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Abstract 

We consider the problem of developing robust algorithms which cope with noisy data. In the 
Probably Approximately Correct model of machine learningrwe develop a general technique 
which allows nearly all PAC learning algorithms to be converted into highly efficient PAC 
learning algorithms which tolerate noise. In the field of combinatorial algorithmsFwe develop 
techniques for constructing search algorithms which tolerate linearly bounded errors and prob- 
abilistic errors. 

In the field of machine learningrwe derive general bounds on the complexity of learning in 
the recently introduced Statistical Query model and in the PAC model with noise. We do so 
by considering the problem of improving the accuracy of learning algorithms. In particular rwe 
study the problem of "boosting" the accuracy of "weak" learning algorithms which fall within 
the Statistical Query modeirand we show that it is possible to improve the accuracy of such 
learning algorithms to any arbitrary accuracy. We derive a number of interesting consequences 
from this resultTand in particularFwe show that nearly all PAC learning algorithms can be 
converted into highly efficient PAC learning algorithms which tolerate classification noise and 
malicious errors. 

We also investigate the longstanding problem of searching in the presence of errors. We 
consider the problem of determining an unknown quantity x by asking "yes-no" questionsr 
where some of the answers may be erroneous. We focus on two different models of error: 
the linearly bounded modeirwhere for some known constant r < l/2reach initial sequence of 
i answers is guaranteed to have no more than ri errorsFand the probabilistic modeirwhere 
errors occur randomly and independently with probability p < 1/2. We develop highly efficient 
algorithms for searching in the presence of linearly bounded errorsPand we further show that 
searching in the presence of probabilistic errors can be efficiently reduced to searching in the 
presence of linearly bounded errors. 

Thesis Supervisor: Ronald L. Rivest 
Title: Professor of Computer Science 
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Chapter 1 



Introduction 



This thesis is concerned with the problems of concept learning and searchingT&nd in particularT 
it deals with the problem of coping with faulty data in each of these settings. In the sections 
that followTwe informally introduce the topics of concept learning and searching. 

1.1 Concept Learning 

A concept is simply a rule which divides objects into two categories: positive examples and 
negative examples. The concept "circular region of radius 1 centered at the originF for instanceT 
divides all points in the plane into positive and negative examples. All points in the plane 
whose Euclidean distance to the origin is at most 1 are positive examples of this conceptTwhile 
all points in the plane whose Euclidean distance to the origin is greater than 1 are negative 
examples. Concept learning is the problem of inferring a rule from a set of positive and negative 
examples of that rule. 

Devising machinesT algorithms and programs which can learn concepts from positive and 
negative examples is an important goal of artificial intelligencer and it has motivated much 
research in the field. In this thesisTwe approach the problem of concept learning from the 
perspective of computational learning theory [1] in that we are concerned with the ability to 
learn concepts efficiently. A machine or algorithm for concept learning is said to be efficient if 
the quantities of resources it uses (e.g. timeFspacerexamplesretc.) are bounded by polynomials 
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in the various learning parameters. 

Many models of concept learning have been developedrand we focus on one such modeirthe 
Probably Approximately Correct (PAC) model introduced by Valiant [34]. In the PAC model 
of learningrthe unknown concept to be learned is assumed to be a member of a known concept 
classT&nd our goal is to develop a general algorithm which can learn any concept in the known 
class from examples of that concept. In the above examplerfor instancerthe known concept 
class may be "circular regions in the planeF and the unknown target concept is "circular region 
of radius 1 centered at the origin." 

A great many algorithms have been devised to "PAC-learn" various concept classes [1]; 
howeverrnearly all of these algorithms are "brittle" in the sense that they cannot tolerate noisy 
data. In Part I of this thesisPa model of PAC learning is studied which is both general in the 
sense that it encompasses nearly all known PAC learning algorithms and robust in the sense that 
many types of noise can be cleanly and efficiently accommodated. We extend and improve this 
model of learning in ways which both increase the power of the algorithm designer and decrease 
the complexity of the resultant robust learning algorithms. A more formal introduction to this 
topic and to the corresponding results contained within this thesis can be found in Chapter 2. 

1.2 Searching 

The problem of search is a fundamental one in computer sciencePand its importance need hardly 
be justified. Perhaps the simplest example of search is embodied in the problem of finding a 
particular element within a sorted collection of elements. The standard binary search algorithm 
can be used to optimally solve this problem wherein all of the requisite queries are comparison 
questions. Thisrhoweverris but one type of search. Consider alsorfor examplerthe problem 
of medical diagnosis. A patient is suffering from some unknown diseasePand a doctor has at 
his disposal a number of testsPeach of which rules out some number of possible maladies. By 
performing some number of these testsrthe doctorPcanrin principler eliminate all possibilities 
until only one remains. This too is an example of search wherein the "elements" (possible 
diseases) are not ordered and the queries themselves correspond to subset questions. Instances 
of search are pervasivePand we provide these two examples merely for illustrative purposes. 
The problem of searching in an optimal fashion is well understoodrbut not if one were to 
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allow "faulty" data. For exampleFconsider the aforementioned medical diagnosis exampleFand 
suppose that some tests were known to occasionally be false-positive or false-negative. It is 
problems such as this which motivate the work in Part II of this thesis wherein various models 
of searching in the presence of faulty data are studied and robust algorithms for performing 
search are developed. A more formal introduction to this topic and to the corresponding results 
contained within this thesis can be found in Chapter 7. 



Part I 



Learning in the Presence of Noise 
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Chapter 2 



Introduction 



The statistical query model of learning was created so that algorithm designers could construct 
noise-tolerant PAC learning algorithms in a natural way. IdeallyTsuch a model of robust learning 
should restrict the algorithm designer as little as possible while maintaining the property that 
these new learning algorithms can be efficiently simulated in the PAC model with noise. In 
the following chaptersFwe both extend and improve the current statistical query model in 
ways which both increase the power of the algorithm designer and decrease the complexity 
of simulating these new learning algorithms. In this chapterFwe summarize our results and 
introduce the various models of learning required for the exposition that follows. 

2.1 Introduction 

Since Valiant's introduction of the Probably Approximately Correct model of learning [34] T 
PAC learning has proven to be an interesting and well studied model of machine learning. In 
an instance of PAC learningTa learner is given the task of determining a close approximation 
of an unknown {0,I}-valued target function f from labelled examples of that function. The 
learner is given access to an example oracle and accuracy and confidence parameters. When 
polledrthe oracle draws an instance according to a distribution D and returns the instance 
along with its label according to /. The error rate of an hypothesis output by the learner is the 
probability that an instance chosen according to D will be mislabelled by the hypothesis. The 
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learner is required to output an hypothesis such thatLwith high confidencerthe error rate of the 
hypothesis is less than the accuracy parameter. Two standard complexity measures studied in 
the PAC model are sample complexity and time complexity. Efficient PAC learning algorithms 
have been developed for many function classes [l]Land PAC learning continues to be a popular 
model of machine learning. 

The model of learning described above is often referred to as the strong learning model since 
a learning algorithm may be required to output an arbitrarily accurate hypothesis depending on 
the accuracy parameter supplied. An interesting variant referred to as the weak learning model is 
identicalLexcept that there is no accuracy parameter and the output hypothesis need only have 
error rate slightly less than 1/2. In other wordsrthe output of a weak learning algorithm need 
only perform slightly better than random guessing. A fundamental and surprising result first 
shown by Schapire [28F29] and later improved upon by Freund [14115] states that any algorithm 
which efficiently weakly learns can be transformed into an algorithm which efficiently strongly 
learns. These results have important consequences for PAC learningrincluding providing upper 
bounds on the time and sample complexities of strong learning. 

One criticism of the PAC model is that the data presented to the learner is assumed to 
be noise-free. In factL most of the standard PAC learning algorithms would fail if even a 
small number of the labelled examples given to the learning algorithm were "noisy." Two 
popular noise models for both theoretical and experimental research are the classification noise 
model introduced by Angluin and Laird [2r21] and the malicious error model introduced by 
Valiant [35] and further studied by Kearns and Li [20]. In the classification noise modelLeach 
example received by the learner is mislabelled randomly and independently with some fixed 
probability. In the malicious error modelLan adversary is allowedLwith some fixed probabilityL 
to substitute a labelled example of his choosing for the labelled example the learner would 
ordinarily see. 

While a limited number of efficient PAC algorithms had been developed which tolerate 
classification noise [2L16L26]Lno general framework for efficient learning 1 in the presence of 
classification noise was known until Kearns introduced the Statistical Query model [19]. 



Angluin and Laird [2] introduced a general framework for learning in the presence of classification noise. 
However, their methods do not yield computationally efficient algorithms in most cases. 
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In the SQ modeirthe example oracle of the standard PAC model is replaced by a statistics 
oracle. An SQ algorithm queries this new oracle for the values of various statistics on the 
distribution of labelled examplesFand the oracle returns the requested statistics to within some 
specified additive error. Upon gathering a sufficient number of statisticsrthe SQ algorithm 
returns an hypothesis of the desired accuracy. Since calls to a statistics oracle can be simulated 
with high probability by drawing a sufficiently large sample from an example oracleFone can 
view this new oracle as an intermediary which effectively limits the way in which a learning algo- 
rithm can make use of labelled examples. Two standard complexity measures of SQ algorithms 
are query complexityTthe maximum number of statistics requiredrand toleranceTthe minimum 
additive error required. The time and sample complexities of simulating SQ algorithms in the 
PAC model are directly affected by these measures; thereforeFwe would like to bound these 
measures as closely as possible. 

Kearns [19] has demonstrated two important properties of the SQ model which make it wor- 
thy of study. FirstThe has shown that nearly every PAC learning algorithm can be cast within 
the SQ modeirthus demonstrating that the SQ model is quite general and imposes a rather 
weak restriction on learning algorithms. Secondrhe has shown that calls to a statistics oracle 
can be simulatedrwith high probabilityrby a procedure which draws a sufficiently large sample 
from a classification noise oracle. An immediate consequence of these two properties is that 
nearly every PAC learning algorithm can be transformed into one which tolerates classification 
noise. 

Decatur [9] has further demonstrated that calls to a statistics oracle can be simulatedrwith 
high probabilityrby a procedure which draws a sufficiently large sample from a malicious error 
oracle. ThusFnearly every PAC learning algorithm can be transformed into one which tolerates 
malicious errors. While Kearns and Li [20] had previously demonstrated a general technique 
for converting a PAC learning algorithm into one which tolerates small amounts of malicious 
errorrthe results obtained by appealing to SQ are better in some interesting cases [9]. 

While greatly expanding the function classes known to be learnable in the presence of noiser 
Kearns' technique does not constitute a formal reduction from PAC learning to SQ learning. 
In factTsuch a reduction cannot exist: while the class of parity functions is known to be PAC 
learnable [17]T Kearns has shown that this class is provably unlearnable in the SQ model. 
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Kearns' technique for converting PAC algorithms to SQ algorithms consists of a few general 
rulesTbut each PAC algorithm must be examined in turn and converted to an SQ algorithm 
individually. ThusFone cannot derive general upper bounds on the complexity of SQ learning 
from upper bounds on the complexity of PAC learningrdue to the dependence on the specific 
conversion of a PAC algorithm to an SQ algorithm. A consequence of this fact is that general 
upper bounds on the time and sample complexities of PAC learning in the presence of noise are 
not directly obtainable either. 

We obtain bounds for SQ learning and PAC learning in the presence of noise by making 
use of the following result. We define weak SQ learning in a manner analogous to weak PAC 
learningPand we show that it is possible to boost the accuracy of weak SQ algorithms to obtain 
strong SQ algorithms. ThusPwe show that weak SQ learning is equivalent to strong SQ learning. 
We use the technique of "boosting by majority" [15] which is nearly optimal in terms of its 
dependence on the accuracy parameter e. 

In the SQ modeiras in the PAC modeirthis boosting result allows us to derive general 
upper bounds on many complexity measures of learning. SpecificallyPwe derive simultaneous 
upper bounds with respect to e on the number of queriesPO(log -)Tthe Vapnik-Chervonenkis 
dimension of the query spacer O(log - log log -)r and the inverse of the minimum tolerancer 
O(-log-). In additionPwe show that these general upper bounds are nearly optimal by de- 
scribing a class of learning problems for which we simultaneously lower bound the number 
of queries by S7(j^jlog i) and the inverse of the minimum tolerance by fi(^). Here d is the 
Vapnik-Chervonenkis dimension of the function class to be learned. 

The complexity of a statistical query algorithm in conjunction with the complexity of simu- 
lating SQ algorithms in the various noise models determine the complexity of the noise-tolerant 
PAC learning algorithms obtained. Kearns [19] has derived general bounds on the minimum 
complexity of SQ algorithmsPand we derive some specific lower bounds as well. Our boosting 
result provides a general technique for constructing SQ algorithms which are nearly optimal 
with respect to these bounds. Howeverrthe robust PAC learning algorithms obtained by sim- 
ulating even optimal SQ algorithms in the presence of noise are inefficient when compared to 
known lower bounds for PAC learning in the presence of noise [Iir20r30]. In factTthe PAC 
learning algorithms obtained by simulating optimal SQ algorithms in the absence of noise are 
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inefficient when compared to the tight bounds known for noise-free PAC iearning [7ril]. These 
shortcomings couid be consequences of either inefficient simuiations or a deficiency in the modei 
itseif. In this thesisPwe show that both of these explanations are truer and we provide both 
new simulations and a variant of the SQ model which combat the current inefficiencies of PAC 
learning via the statistical query model. 

We improve the complexity of simulating SQ algorithms in the presence of classification 
noise by providing a more efficient simulation. If r* is a lower bound on the minimum additive 
error requested by an SQ algorithm and r] h < 1/2 is an upper bound on the unknown noise 
raterthen Kearns' original simulation essentially runs Q( , ^ — ^) different copies of the SQ 
algorithm and processes the results of these runs to obtain an output. We show that this 
"branching factor" can be reduced to 0(^log -^ — )Tthus reducing the time complexity of the 
simulation. We also provide a new and simpler proof that statistical queries can be estimated in 
the presence of classification noiseFand we show that our formulation can easily be generalized 
to accommodate a strictly larger class of statistical queries. 

We improve the complexity of simulating SQ algorithms in the absence of noise and in the 
presence of malicious errors by proposing a natural variant of the SQ model and providing 
efficient simulations for this variant. In the relative error SQ modeirwe allow SQ algorithms to 
submit statistical queries whose estimates are required within some specified relative error. We 
show that a class is learnable with relative error statistical queries if and only if it is learnable 
with (standard) additive error statistical queries. Thusrknown learnability and hardness results 
for statistical query learning [6ri9] also hold in this variant. 

We demonstrate general bounds on the complexity of relative error SQ learningPand we 
show that many learning algorithms can naturally be written as highly efficientrrelative error 
SQ algorithms. We further provide simulations of relative error SQ algorithms in both the 
absence and presence of noise. Phese simulations in the absence of noise and in the presence 
of malicious errors are more efficient than the simulations of additive error SQ algorithmsPand 
given a roughly optimal relative error SQ algorithmr these simulations yield roughly optimal 
PAC learning algorithms. These results hold for all function classes which are SQ learnable 

Finahyr we show that our simulations of SQ algorithms in the absence of noiserin the 
presence of classification noiser and in the presence malicious errors can all be modified to 
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accommodate a strictly larger class of statistical queries. In particularFwe show that our simu- 
lations can accommodate real-valued statistical queries. Real-valued queries allow an algorithm 
to query the expected value of a real-valued function of labelled examples. Our results on 
improved simulations hold for this generalization in both the absence and presence of noise. 

The remainder of this work is organized as follows. In Section 2.2rwe formally define the 
learning models of interestTand in Section 2.3rwe describe PAC model boosting results which 
are used in later chapters. In Chapters 3 and 4rwe present our additive error and relative error 
SQ model resultsFrespectively. In Chapter 5rwe present some extensions of our resultsFand 
we conclude with a discussion of some open questions in Chapter 6. 

2.2 Learning Models 

In this sectionrwe formally define the relevant models of learning necessary for the exposition 
that follows. We begin by defining the weak and strong PAC learning modelsrfollowed by the 
classification noise and malicious error modelsFand finally the statistical query model. 

2.2.1 The Weak and Strong PAC Learning Models 

In an instance of PAC learningra learner is given the task of determining a close approximation 
of an unknown {0, l}-valued target function from labelled examples of that function. The 
unknown target function / is assumed to be an element of a known function class T defined 
over an instance space X. The instance space X is typically either the Boolean hypercube 
{0, 1}" or ra-dimensional Euclidean space ?R. n . We use the parameter n to denote the common 
length of each instance x £ X . 

We assume that the instances are distributed according to some unknown probability dis- 
tribution D on X. The learner is given access to an example oracle EX(f,D) as its source of 
data. A call to EX(f,D) returns a labelled example (x,l) where the instance x £ X is drawn 
randomly and independently according to the unknown distribution DTand the label / is equal 
to f(x). We often refer to a sequence of labelled examples drawn from an example oracle as a 
sample. 

A learning algorithm draws a sample from EX(f,D) and eventually outputs an hypothesis 
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h from some hypothesis class 7i defined over X . For any hypothesis hTthe error rate of h is 
defined to be the probability that h mislabels an instance drawn randomly according to D. 
By using the notation Pr£>[P(a;)] to denote the probability that a predicate P is satisfied by 
an instance drawn randomly according to DTwe may define error (h) = Fr D [h(x) ^ /(%)]■ We 
often think of 7i as a class of representations of functions in .7-Tand as such we define size(f) 
to be the size of the smallest representation in 7i of the target function /. 

The learner's goal is to outputTwith probability at least 1 0<Tan hypothesis h whose error 
rate is at most erfor the given accuracy parameter e and confidence parameter S. A learning 
algorithm is said to be polynomially efficient if its running time is polynomial in 1/eTl/STn 
and size(f). We formally define PAC learning as follows (adapted from Kearns [19]): 

Definition 1 (Strong PAC Learning) 

Let T and 7i be function classes defined over X . The class T is said to be polynomially learnable 
by 7i if there exists a learning algorithm A and a polynomial p(-, •, •, •) such that for any f £ T , 
for any distribution D on X , for any accuracy parameter e, < e < 1, and for any confidence 
parameter S, < S < 1, the following holds: if A is given inputs e and S, and access to an 
example oracle EX(f,D), then A halts in time bounded by p(l/e,l/S,n,size(f)) and outputs 
an hypothesis h £ 7i that with probability at least 1 0<5 satisfies error(h) < e. 

As statedrthis is often referred to as strong learning since the learning algorithm may be 
required to output an arbitrarily accurate hypothesis depending on the input parameter e. A 
variant of strong learning called weak learning is identicair except that there is no accuracy 
parameter e and the output hypothesis need only have error rate slightly less than l/2r«'.e. 
error(h) < | 07 = | O , s Le(f)) ^ or some polynomial p. Since random guessing would produce 
an error rate of l/2rone can view the output of a weak learning algorithm as an hypothesis 
whose error rate is slightly better than random guessing. We refer to the output of a weak 
learning algorithm as a weak hypothesis and the output of a strong learning algorithm as a 
strong hypothesis. 

2.2.2 The Classification Noise and Malicious Error Models 

One criticism of the PAC model is that the data presented to the learner is required to be 
noise-free. Two popular models of noise for both experimental and theoretical purposes are 
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the classification noise model introduced by Angluin and Laird [2r21] and the malicious error 
model introduced by Valiant [35]. 

The Classification Noise Model 

In the classification noise modeWthe example oracle EX(f,D) is replaced by a noisy example 
oracle EX^if ', D) . Each time this noisy example oracle is calledran instance x £ X is drawn 
according to D. The oracle then outputs (x,f(x)) with probability 1 O?? or (x,^f(x)) with 
probability ^randomly and independently for each instance drawn. Despite the noise in the 
labelled examplesrthe learner's goal remains to output an hypothesis h whichrwith probability 
at least 1 O^rhas error rate error(h) = Fr D [h(x) ^ f(x)] at most e. 

While the learner does not typically know the exact value of the noise rate ^rthe learner 
is given an upper bound r] h on the noise raterO < r] < r] h < l/2rand the learner is said to be 
polynomially efficient if its running time is polynomial in the usual PAC learning parameters 
as well as —4; — . 

1-21)6 

The Malicious Error Model 

In the malicious error modelTthe example oracle EX(f,D) is replaced by a noisy example 
oracle EX^ Ah (f,D). When a labelled example is requested from this oracleFwith probability 
I o/3ran instance x is chosen according to D and (x,f(x)) is returned to the learner. With 
probability (3Ta malicious adversary selects any instance x £ XTselects a label / £ {0, IjTand 
returns (x,l). Againrthe learner's goal is to output an hypothesis h whichrwith probability at 
least I O^rhas error rate error (h) = Fr D [h(x) ^ f(x)] at most e. 

2.2.3 The Statistical Query Model 

In the statistical query modeirthe example oracle EX(f,D) from the standard PAC model is 
replaced by a statistics oracle STAT(f,D). An SQ algorithm queries the STAT oracle for the 
values of various statistics on the distribution of labelled examples (e.g. "What is the probability 
that a randomly chosen labelled example {x,l) has variable x, = and / = I?")rand the STAT 
oracle returns the requested statistics to within some specified additive error. Formallyr a 
statistical query is of the form [\, t]. Here \ 1S a mapping from labelled examples to {0, 1} (i.e. 
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X '■ X X {0, 1} — ► {0, 1}) corresponding to an indicator function for those labelled examples 
about which statistics are to be gatheredrwhile r is an additive error parameter. A call [\, t] 
to STAT(f,D) returns an estimate P x of P x = Fr D [x(x, f(x))] which satisfies \P X oP x | < r. 

A call to STAT(f,D) can be simulatedr with high probabilityr by drawing a sufficiently 
large sample from EX(f,D) and outputting the fraction of labelled examples which satisfy 
x(x,f(x)) as the estimate P x . Since the required sample size depends polynomially on f/r and 
the simulation time additionally depends on the time required to evaluate ;\Tan SQ learning 
algorithm is said to be polynomially efficient if 1/rrthe time required to evaluate each %rand 
the running time of the SQ algorithm are all bounded by polynomials in 1/eTn and size(f). We 
formally define polynomially efficient learning in the statistical query model as follows (adapted 
from Kearns [19]): 

Definition 2 (Strong SQ Learning) 

Let T and 7i be function classes defined over X. The class T is said to be polynomially 
learnable via statistical queries by 7i if there exists a learning algorithm A and polynomials 
Pi(-, •, •), P2O7 •, •), o,nd Vsi.'-) '■) ') suc h that for any f £ T , for any distribution D on X , and for 
any error parameter e, < e < 1, the following holds: if A is given input e and access to a 
statistics oracle STAT(f,D), then (1) for every query [\, t] made by A, x can be evaluated in 
time bounded by pi(l/e,n,size(f)) and 1/r is bounded by p2(l/e,n,size(f)), and (2) A halts in 
time bounded by p 3 (l/e,n,size(f)) and outputs an hypothesis h £ 7i that satisfies error(h) < e. 

For an SQ algorithm ATwe may further define its query complexity and tolerance. In a given 
instance of learningrthe query complexity of A is the number of queries submitted by yirand 
the tolerance of A is the smallest additive error requested by A. We let iV* = iV*(e, ra, size(f)) 
be an upper bound on the query complexity of yirand we let r* = Tt,(e,n,size(f)) be a lower 
bound on the tolerance of A. Note that iV* < p 3 (l/e,n,size(f)) and r* > l/p 2 (l/e, n, size(f)). 
Since calls to a statistics oracle can be simulated by a procedure which draws a sample from 
an example oracleFone can view the statistical query model as simply restricting the way in 
which PAC learning algorithms can make use of labelled examples. Kearns has shown that this 
restriction is rather weak in that nearly every PAC learning algorithm can be cast in the SQ 
model. 
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An important property of this model is that calls to a statistics oracle can also be simulatedr 
with high probability! 1 by a procedure which draws a sample from a classification noise oracle 
i?Jp N (/,_D) [19] or a malicious error oracle EX^ Ah (f,D) [9]. In the former caserthe sample 
size required is polynomial in l/rn/(l 02?^) and log(l/S); in the latter caserthe sample 
size required is polynomial in 1/r and log(l/S). While a reasonably efficient simulation of 
an SQ algorithm can be obtained by drawing a separate sample for each call to the statistics 
oraclerbetter bounds on the sample complexity of the simulation are obtained by drawing one 
large sample and estimating each statistical query using that single sample. If we let Q be the 
function space from which an SQ algorithm A selects its queriesrthen the size of the single 
sample required is independent of the query complexity of A but depends on either the size of 
Q or the Vapnik-Chervonenkis dimension 2 of Q. Q is referred to as the query space of the SQ 
algorithm A. 

Kearns has shown that an SQ algorithm can be simulated in the classification noise model 
using a sample size which depends on QT t^T ST e and r] h . Decatur has shown that an SQ 
algorithm can be simulated in the malicious error model using a sample size which depends on 
QIYh, and S. The amount of malicious error which can be tolerated by the latter simulation 
depends on r*. Given that nearly every PAC learning algorithm can be converted to an SQ 
algorithmran immediate consequence of these results is that nearly every PAC algorithm can 
be transformed into one which tolerates noise. The complexities of these noise-tolerant versions 
depend on r* and Qrwhich themselves are a function of the ad hoc conversion of PAC algorithms 
to SQ algorithms. ThusFone cannot show general upper bounds on the complexity of these 
noise-tolerant versions of converted PAC algorithms. 

We define weak SQ learning identically to strong SQ learning except that there is no accuracy 
parameter e. In this caserthe output hypothesis need only have error rate slightly less than l/2r 
i.e. error (Ji) < |<^7 = |o , S ize(f)) ^ or some polynomial p. By showing that weak SQ learning 
algorithms can be "boosted" to strong SQ learning algorithmsPwe derive general lower bounds 
on the tolerance of SQ learning and general upper bounds on the complexity of the requisite 
query space. We are then able to show general upper bounds on the complexity noise-tolerant 
PAC learning via the statistical query model. These results are given in Chapters 3 and 4. 



2 VC-dimension is a standard complexity measure for a space of {0, l}-valued functions. 
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2.3 Boosting in the PAC Model 

In this sectionrwe describe the PAC model boosting results on which our SQ model boosting 
results are based. 

Schapire [28F29] and Freund [14T15] use similar strategies for boosting weak learning al- 
gorithms to strong learning algorithms. They both create a strong hypothesis by combining 
many hypotheses obtained from multiple runs of a weak learning algorithm. The boosting 
schemes derive their power by essentially forcing the weak learning algorithmrin later runsr 
to approximate the target function / with respect to new distributions which "heavily" weight 
those instances that previous hypotheses misclassify. By suitably constructing example oracles 
corresponding to these new distributions and properly combining the hypotheses obtained from 
multiple runs of the weak learning algorithmFa strong learning algorithm can be produced 
which uses the weak learning algorithm as a subroutine. 

Freund has developed two similar methods (which we call Scheme 1 and Scheme 2) for 
boosting weak learning algorithms to strong learning algorithms. One is more efficient with 
respect to e while the other is more efficient with respect to 7. Freund develops a hybrid 
scheme more efficient than either Scheme 1 or Scheme 2 by combining these two methods in 
order to capitalize on the advantages of each. We first describe the two schemes separately and 
then show how to combine them. 

2.3.1 Boosting via Scheme 1 in the PAC Model 

Scheme 1 uses a weak learning algorithm to create a set of ki = -^ In ^ weak hypotheses and 
outputs the majority vote of these hypotheses as the strong hypothesis. The weak hypotheses 
are created by asking the weak learner to approximate / with respect to various modified distri- 
butions over the instance space X . The distribution used to generate a given weak hypothesis 
is based on the performance of the previously generated weak hypotheses. Hypothesis hi is 
created in the usual way by using EX(f,D). For all i > lrhypothesis h i+ i is created by giving 
the weak learner access to a filtered example oracle EX(f,D i+ i) defined as follows: 
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1. 


Draw a labelled example {x, f(x)} from EX(f, D). 


2. 


Compute hi(x), . . . , hi(x). 


3. 


Set r to be the number of hypotheses which agree with / on x. 


4. 


Flip a biased coin with Pr[HEAD] = a l r . 


5. 


If HEAD, then output example (x,f(x)}, otherwise go to Step 1. 



When k weak hypotheses are to be generatedrthe set of probabilities {a l r } are fixed according 
to the following binomial distribution: 

'(J if r > LfJ 



at 



(Lip(i+7)W- r (io 7 )m-^ 



if r <io[|l + l 

Freund shows thatrwith high probabilityr the majority vote of hi, . . . ,h kl has error rate no 
more than e with respect to D if each hj has error rate no more than | 07 with respect to Dj. 
One pitfall of this scheme is that the simulation of EX(f,D i+ i) may need to draw many 
examples from EX(f,D) before one is output to the weak learner. Let i, be the probability 
that an example drawn randomly from EX(f,D) passes through the probabilistic filter which 
defines EX(f,D i+ i). Freund observes that if i, < ce 2 for some constant crthen the majority 
vote of hi, . . ., hi is already a strong hypothesis. The boosting algorithm can estimate i 8 Tand if 
ti is below the cutoffTthe algorithm may halt and output the majority vote of the hypotheses 
created thus far. The boosting algorithm's time and sample complexity dependence on 7 is 
0(l/7 2 )rwhile its dependence on e is 0(l/e 2 ). 3 

2.3.2 Boosting via Scheme 2 in the PAC Model 

Scheme 2 is very similar to Scheme 1. The weak learner is again called many times to provide 
weak hypotheses with respect to filtered distributions. This method uses k 2 = 2k 1 = -^ln- 



For asymptotically growing functions g, g > 1, we define O(g) to mean 0(</log c g) for some constant c > 0. 
For asymptotically shrinking functions g, < g < 1, we define O(g) to mean 0(g log c (l/</)) for some constant 
c > 0. We define Q similarly for constants c < 0. Finally, we define to mean both O and Q. This asymptotic 
notation, read "soft-O," "soft-Omega," and "soft-Theta," is convenient for expressing bounds while ignoring 
lower order factors. It is somewhat different than the standard soft-order notation. 
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weak hypothesesrwhile the filtered example oracle remains the same. The main difference is the 
observation that if t t < ^"f^T then we may simply use a "fair coin" in place of h i+ i and still 
be guaranteedrwith high probabilityrthat the final majority of k 2 hypotheses has error rate no 
more than e. 4 The boosting algorithm estimates i, to see if it is below this new threshold. If sor 
a "fair coin" is used as hypothesis /j i+1 rand the algorithm proceeds to find a weak hypothesis 
with respect to the next distribution. The boosting algorithm's time and sample complexity 
dependence on 7 is 0(l/7 3 )rwhile its dependence on e is 0(l/e). 

2.3.3 Hybrid Boosting in the PAC Model 

An improvement on these two boosting schemes is realized by using each in the "boosting 
range" for which it is most efficient. The first method is more efficient in l/7rwhile the second 
method is more efficient in 1/e. We therefore use the first method to boost from | 07 to a 
constantTand we use the second method to boost from that constant to e. Let Al be a learning 
algorithm which uses Scheme 1 and makes calls to the weak learning algorithm «4i_ 7 . The 
strong learning algorithm A c uses Scheme 2 and makes calls to Ai as its "weak learner." The 
strong hypothesis output by such a hybrid algorithm is a depth two circuit with a majority 
gate at the top level. The inputs to the top level are "fair coin" hypotheses and majority gates 
whose inputs are weak hypotheses with respect to various distributions. The hybrid's time and 
sample complexity dependence on 7 is 0(l/7 2 )rwhile its dependence on e is 0(l/e). 



A "fair coin" hypothesis ignores its input x and outputs the outcome of a fair coin flip. 



Chapter 3 



Learning Results in the 
Additive Error SQ Model 



In this chapteiTwe derive a number of results in the additive error statistical query model. 
We begin by showing that it possible to boost weak learning algorithms in the SQ modeir 
and from this we derive general bounds on learning in the SQ model. We then describe a 
new method for simulating SQ algorithms in the PAC model with classification noise. Finallyr 
by combining the aforementioned resultsFwe derive general bounds on PAC learning in the 
presence of classification noise which apply to all function classes known to be SQ learnable. 

3.1 Boosting in the Statistical Query Model 

Boosting is accomplished by forcing a weak learning algorithm to approximate the target func- 
tion / with respect to modified distributions over the instance space. Specificallyrthe boosting 
methods described in the previous chapter are based on the observation thatPwith high probabil- 
ityTthe majority vote of hi, . . . , h k has error rate at most e with respect to D if each constituent 
hj has error rate at most | 07 with respect to Dj. In the PAC modeira learner interacts with 
the distribution over the instance space through calls to an example oracle. Thereforerboosting 
in the PAC model is accomplished by constructing EX(J,Dj) from the original example oracle 
EX(J,D). In the SQ modeira learner interacts with the distribution over labelled examples 
through calls to a statistics oracle. Thereforerboosting in the SQ model is accomplished by 

31 
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constructing STAT(f,Dj) from the original statistics oracle STAT(f,D). 

In the sections that followrwe first show how to boost a weak SQ algorithm using either 
Scheme 1 or Scheme 2. We then show how to boost a weak SQ algorithm using the hybrid 
method. Although it is possible to boost in the SQ model using Schapire's methodrwe do not 
describe these results since they are somewhat weaker than those presented here. 

3.1.1 Boosting via Scheme 1 in the Statistical Query Model 

We can use Scheme 1 to boost weak SQ learning algorithms by simply answering statistical 
queries made with respect to modified distributions. ThereforeFwe must be able to simulate 
queries to STAT(f,Dj) by making queries to STAT(f,D). We first show how to specify the 
exact value of a query with respect to Dj in terms of queries with respect to D. We then 
determine the accuracy with which we need to make these queries with respect to D in order 
to obtain a sufficient accuracy with respect to Dj. 

The modified distributions required for boosting are embodied in the five step description 
of the filtered example oracle given in Section 2.3.1. Note that Steps 2 and 3 partition the 
instance space into i + I regions corresponding to those instances which are correctly classified 
by the same number of hypotheses. Let X l r C X be the set of instances which are correctly 
classified by exactly r of the i hypotheses. We define the induced distribution D z on a set Z with 
respect to distribution D as follows: For any Y C ZTD Z [Y] = D[Y]/D[Z]. By constructionr 
for any given X l r regionrthe filtered example oracle uniformly scales the probability with which 
examples from that region are drawn. Thereforerthe induced distribution on X l T with respect 
to -Dj'+i is the same as the induced distribution on X l T with respect to D . (This fact is used to 
obtain Equation 3.2 from Equation 3.1 below.) 

A query [\, t] to STAT(f, D i+ i) is a call for an estimate of Pr^ [x( x , f( x ))] within additive 
error r. We derive an expression for Fr Dt+1 [x(x, f(x))] below. 



PT Di+1 \x(x,f(x))] = J2 Fr D, + Ax(xJ(x))\(xeX^]-Pr Dt+1 [xeX^] (3.1) 

r = 
i 

= J2 FT D[x( x J(x))\(xeXi.)}-Fr Dt+1 [xeX;.} (3.2) 
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*Fr_ D [ x (x,f(x)) A(xe XI)} < • Vi D [x G X>] 



Fv D [x e x;] £j =0 a ) ■ Pr c [x e xj 

K=o<-^D[x(xJ(xj)A(xeX^] 

T? i = <x)-'PTD[xexj] 



(3.3) 



Note that the denominator of Equation 3.3 is the probability that an example drawn randomly 
from EX(f,D) passes through the probabilistic filter which defines EX(f,D i+ i). Recall that 
Freund calls this probability i,. 

Ignoring the additive error parameter for the momentTthe probabilities in Equation 3.3 can 
be stated as queries to STAT(f,D) as follows 

STATU. A+ ,) M= S^-' mr(/ -° faA ^ ,3.4) 

Y.U«;- STATU, D)[x)] 

where x)( x ^) ls true if and only if x £ Xj. Note that query x) ls polynomially evaluatable 
given hi, . . .,/i 8 Tthus satisfying the efficiency condition given in the definition of SQ learning. 
We next determine the accuracy with which we must ask these queries so that the final 
result is within the desired additive error r. We make use of the following two claims. 

Claim 1 If < a, b, c, t < 1 and a = b/c, then to obtain an estimate of a within additive 
error t, it is sufficient to obtain estimates of b and c within additive error cr/3. 

Proof: We must show that (b + cr/3)/(c ocr/3) < a + t and (b ocr/3)/(c + cr/3) > a Or. 
The claim is proven as follows. 

b + cr/3 a + r/3 

cOcr/3 ~ Tor/3 

= <« + '/»> (1 + 1^3) 

= (a + r/3)(l + r/2) 
= a+ar/2 + r/3 + r 2 /6 
< a+ t 
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frOcr/3 _ a Or/3 

c + cr/3 ~ l + r/3 

= <" 0r/3, ( l0 TT^) 

> (aOr/3)(fOr/3) 

= a Oar/3 Or/3 + r 2 /9 

> aOr 



a 



Claim 2 If < s,pi,Zi,T < 1, < EiPs ^ 1 an ^ s = J2iPi z i, then to obtain an estimate of 
s within additive error t, it is sufficient to obtain estimates of each z, within additive error r 
provided that the pi coefficients are known. 

Proof: The claim follows immediately from the inequalities given below. 

E; Pi( z i +t) = Ei Pi z i + T J2iPi < S + T 
Ei Pi{zi &t) = J2i PiZi Or J2i Pi > s Or 

□ 

Applying Claims 1 and 2 to Equation 3.4rwe find that it is sufficient to submit queries 
to STAT(f, D) with additive error i, • r/3 in order to simulate a call to STAT(f,D i+ i) with 
additive error r. There are two problems with this observation. Firstrif i, is smahTthen we 
are forced to submit queries with small additive error. Secondrthe value i, is unknownFand in 
factTit is the value of the denominator we are attempting to estimate. We can overcome these 
difficulties by employing the "abort" condition of Freund which allows us to either lower bound 
ti or abort the search for h i+ i. 

If ti < ce 2 rthen the majority vote of the hypotheses generated thus far is a strong hypothesis. 
Submit each query to STAT(f,D) with additive error 2 +3/T m ^ e ^ ^' ^ e ^ e es timate for ti 
obtainedrand note that by Claim 2Yti is within additive error 2 V^i of i 8 -. If ti < ce 2 (f 2 3, )T 
then ti < ce 2 . In this caseFwe may halt and output the majority vote of the hypotheses 
created thus far. If i t > ce 2 (f 2 3 , )rthen t t > ce 2 (f 2 2 , ) = ce 2 ( 2 + J, T )- In this caseFour 
estimate i, is sufficiently accurate since the additive error required by Claim I is i, • r/3Fand 
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t{ • r/3 > ce 2 ( 2 ,J, T ) ■ r/3 = 2 +3/ T wn i cn is the additive error used. Given that the numerator 
and denominator are both estimated with additive error i, • r/3rtheir ratio is within additive 
error r by Claim 1. 

We can now bound the tolerance of strong SQ learning algorithms obtained by Scheme 1 
boosting. If r = T (n, size(f)) is a lower bound on the tolerance of a weak SQ learning 
algorithmrthen S7(r e 2 ) is a lower bound on the tolerance of the strong SQ learning algorithm 
obtained by Scheme 1 boosting. 

We next examine the query complexity of strong SQ learning algorithms obtained by 
Scheme 1 boosting. Let N = N (n,size(f)) be an upper bound on the query complexity 
of a weak learner. In Equation 3.4rwe note that 2(i + 1) queries to STAT(f, D) are required to 
simulate a single query to STAT(f, D i+ i). Since ki = -^ In ^ is an upper bound on the number 
of weak learners run in the boosting schemeLO(iVo£;i ) = O(N -^ log -) is an upper bound on 
the query complexity of the strong SQ learning algorithm obtained by Scheme 1 boosting. 

We finally examine the query space complexity of strong SQ learning algorithms obtained 
by Scheme 1 boosting. There are two cases to consider depending on the nature of the instance 
space. If the instance space is discreteTe.g. the Boolean hypercube {0, l} n Lthen the query space 
and hypothesis class used by an SQ algorithm are generally finite. In this caseLwe can bound 
the size of the query space used by the strong SQ learning algorithm obtained by boostingL 
and this result is given below. If the instance space is continuousTe.g. ra-dimensional Euclidean 
space !R n rthen the query space and hypothesis class used by an SQ algorithm are generally 
infinite. In this caseLwe can bound the VC-dimension of the query space used by the strong 
SQ learning algorithm obtained by boostingrand this result is given in the appendix. 

Let Q an d 7i be the finite query space and finite hypothesis class used by a weak SQ 
learning algorithm. The queries used by the strong SQ learning algorithm obtained by Scheme 1 
boosting are of the form x^X) an( i X^X) where x & Qo an d x) lS constructed from hypotheses in 
TC . The queries x) are defined by i hypotheses and a number jTO < j < i. Since the hypotheses 
need not be distinctLfor fixed i and j'Lthe number of unique x) queries is (' ~^'~ ). 1 For fixed iT 
the number of x) queries is (i + 1) • (' °'^ 8_1 ). Since i is bounded by A^Lthe total number of 



This expression corresponds to the number of unique arrangements of i indistinguishable balls in |_ffo| bins. 
Each unique arrangement corresponds to a unique \'j ln that the number of balls in bin £ corresponds to the 
number of copies of the hypothesis associated with bin £ used in \'j- 
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X) queries is given by J2iLi(i + 1) ' ( f -1 )' Given that x £ QoTwe may bound the size of 
the query space used by the strong SQ learning algorithm obtained from Scheme 1 boosting as 
follows: 

\Q B \ = |Q | + £(i + 1) • (l«ol+* - 1 ) + |Q | J2(t + 1) • ( |Wol r' _1 ) 

8 = 1 8 = 1 

In the appendixrit is shown that this expression has the following closed form: 

\Q B \ = (IQol + 1) ( lH t kl ) + \n \(\Q \ + 1) C^L+f 1 ) oi 

Furthermorerit is shown the \Qb\ is bounded above as follows: 

\Qb\ <2(|Qo| + 1)(|Wo| + 2) &1 

The complexity of simulating such an SQ algorithm in the various PAC models will depend on 
log|Q B |. We note that log|Q B | = O(log|Q | + ^i log \H \). 

Finallyrin the appendix it is shown that the VC-dimension of the query space is bounded 
as follows: 

VC(Q B ) = O(VC(Q ) + VC(Ho) ■ Mogfci) 

Theorem 1 Given a weak SQ learning algorithm whose query complexity is upper bounded by 
N = N (n, size(f)), whose tolerance is lower bounded by r = T (n, size(f)), whose query space 
and hypothesis class are Q and Ti , respectively, and whose output hypothesis has error rate at 
most | O7, then a strong SQ learning algorithm can be constructed whose query complexity is 
O(N ^jlog -) and whose tolerance is S7(r e 2 ). The query space complexity is given by 

log \Q B \ = O(log |Qo| + ^ log i log \n \) 

when Q and 7i are finite, or 

VC(Q B ) = O(VC(Q )+VC(H )-(^log^log(^log^) 

when Q and 7i have finite VC-dimension. 

3.1.2 Boosting via Scheme 2 in the Statistical Query Model 

We can use Scheme 2 to boost weak SQ learning algorithms in a manner quite similar to that 
described above. Since the "abort" condition of Scheme 2 introduces "fair coin" hypothesesFwe 
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first rederive the probability that x( x , f( x )) ls true with respect to D i+ i in terms of probabilities 
with respect to D. 

When i hypotheses have been generatedr let w be the number of weak hypotheses and 
let i Ow be the number of "fair coin" hypotheses. The weak hypotheses hi, . . . , h w partition 
the instance space X into w + 1 regions corresponding to those instances which are correctly 
classified by the same number of weak hypotheses. Let X™ C X be the set of instances which 
are correctly classified by exactly r of the w weak hypotheses. Consider the probability that an 
instance x £ X™ passes through the probabilistic filter which defines EX(f, D i+ i). If none of the 
"fair coin" hypotheses agree with /rthen this probability is a\. If j of the "fair coin" hypotheses 
agree with fTthe this probability is a£ + ,-. The total probability is thus A™ = Xw=o al r +jPj~ w 
where (f~ w = ( l ~ w ) /2 l ~ w is the probability that exactly j of the "fair coin" hypotheses agree 
with /. The following filtered example oracle is equivalent to EX(f,D i+ i): 



1. 


Draw a labelled example (x, f(x)} from EX(f, D). 


2. 


Compute hi(x), . . . , h w (x). 


3. 


Set r to be the number of hypotheses which agree with / on x. 


4. 


Flip a biased coin with Pr[HEAD] = A™. 


5. 


If HEAD, then output example (x,f(x)}, otherwise go to Step 1. 



We may now derive an expression for Pr_o i [x(x,/(x))] as before. 

w 

PT Di+1 \x(x,f(x))] = J2 F ^, +1 [x(xJ( x ))\( x exr)]-^ Di+1 [xex^ 

r = 

w 

= J2 F ^[x( x ,f(x))\(x e x-)] ■ ?T Di+1 [x e x?] 

r = 

w 

= E 



r = 

PT D [ x (x,f(xj) A(xe x?)] K ■ ?*d[x e x?} 



Pr c [x e x?] E7=o a? • ^d [x e xj 

j:: = oK-^D[x(x,f(x))A(xexr)] 

J27=oK-^o[xexr] 



(3.5) 



STATU, D t+ i)[x\ - E J =o x r .STAT(f,D)[ X r] (3 ' 6) 

Note that the denominators of Equations 3.5 and 3.6 again correspond to the probability i,. 
Also note that J27=o ^r = J27=o Sj=o a r+j@j~ W — 1 smce the unique terms of the latter sum 
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are all contained in the product (J2l=o a r)(S=o fi)~ u ' ) = 1- 

Applying Claims 1 and 2 to Equation 3.6rwe again find that it is sufficient to submit queries 
to STAT(f,D) with additive error i, • r/3 in order to simulate a call to STAT(f,D i+ i) with 
additive error r. Againrthere are two problems with this observation. FirstTif i, is smalirthen 
we are forced to submit queries with small additive error. Secondrthe value i, is unknownFand 
in fact r it is the value of the denominator we are attempting to estimate. We can overcome 
these difficulties by employing the "abort" condition of Freund which allows us to either lower 
bound ti or use a "fair coin" in place of h i+ i. 

If ti < e(l Oe)7/ln(l/e)rthen a "fair coin" can be used in place of /i J+1 . Submit each query 
to STAT(f, D) with additive error 2+3/" ^ e ^ ^' ^ e ^ ne es timate for the t{ obtainedrand 

note that by Claim 2Yh is within additive error ~l,o/ of t,-. If i,- < , ,7, 7 (1 O n ,\, )T 

J i 2+3/t i i m(l/e) V 2+3/r / 

then ti < e(l Oe)7/ln(l/e). In this caseFwe may use a "fair coin" in place h i+ i and proceed to 
the next distribution. If *,- > ^~ (1 o 5T | 77 )rthen *,- > ^"(1 O^) = ^(^). In 
this caseFour estimate i, is sufficiently accurate since the additive error required by Claim 1 is 
ti ■ r/3rand i, • r/3 > i n ([i2 { 2 +3iT ^ ' T '/^ = 2+3/" which is the additive error used. Given 
that the numerator and denominator are both estimated with additive error i, • r/3rtheir ratio 
is within additive error r by Claim 1. 

We can now bound the tolerance of strong SQ learning algorithms obtained by Scheme 2 
boosting. If r = T (n,size(f)) is a lower bound on the tolerance of a weak SQ learning 
algorithmrthen O(r e7/log(l/e)) is a lower bound on the tolerance of the strong SQ learning 
algorithm obtained by Scheme 2 boosting. 

We next examine the query complexity of strong SQ learning algorithms obtained by 
Scheme 2 boosting. Fet N = N (n,size(f)) be an upper bound on the query complexity 
of a weak learner. In Equation 3.6Fwe note that 2(w + 1) < 2(i + 1) queries to STAT(f, D) are 
required to simulate a single query to STAT(f,D i+ i). Since k 2 = -^ln- is an upper bound on 
the number of weak learners run in the boosting schemeFO(A A;2 ) = O(N -^log -) is an up- 
per bound on the query complexity of the strong SQ learning algorithm obtained by Scheme 2 
boosting. 

We finally note that the query space complexity results for Scheme 2 boosting are identical 
to those for Scheme 1 boosting when £7 is replace by k 2 . 
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Theorem 2 Given a weak SQ learning algorithm whose query complexity is upper bounded by 
N = N (n, size(f)), whose tolerance is lower bounded by r = T (n, size(f)), whose query space 
and hypothesis class are Q and Ti , respectively, and whose output hypothesis has error rate at 
most | O7, then a strong SQ learning algorithm can be constructed whose query complexity is 
O(N -^log -) and whose tolerance is O(r e7/log(l/e)). The query space complexity is given 
by 

log \Q B \ = O(log |Qo| + ^log 7 log \n \) 
when Q and 7i are finite, or 

VC(Q B ) = O(VC(Q )+VC(H )-(^log^log(^log^) 
when Q and 7i have finite VC-dimension. 

3.1.3 Hybrid Boosting in the Statistical Query Model 

We obtain a more efficient boosting scheme in the SQ modei by combining the two previously 
described methods. As in the PAC modeirwe use Scheme f to boost from | 07 to | and 
Scheme 2 to boost from | to e. By combining the resuits of Theorem 1 and Theorem 2rwe 
immediateiy obtain an upper bound on the query compiexity of the hybrid boosting scheme 
and a lower bound on the tolerance of the hybrid boosting scheme. An upper bound on the 
query space complexity of the hybrid boosting scheme is given in the appendix. We thus obtain 
the following improved boosting result. 

Theorem 3 Given a weak SQ learning algorithm whose query complexity is upper bounded by 
N = N (n, size(f)), whose tolerance is lower bounded by r = T (n, size(f)), whose query space 
and hypothesis class are Q and Ti , respectively, and whose output hypothesis has error rate at 
most i O7, then a strong SQ learning algorithm can be constructed whose query complexity is 
O(N -^log -) and whose tolerance is O(r e/log(l/e)). The query space complexity is given by 



log\Q HB \ = O(log |Qo| + 7^ log i log \K 



1' 



when Q and 7i are finite, or 



VC(Q HB ) = O(VC(Q )+VC(H )-(±\og\)\og(±\og\)) 
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when Q and 7i have finite VC- dimension. 

Note that the tolerance of the strong SQ learning algorithm constructed has no dependence 
on 7 in this hybrid boosting scheme. 

3.2 General Bounds on Learning in the Statistical Query Model 

In this sectionrwe derive general upper bounds on the complexity of statistical query learning. 
These results are obtained by applying the boosting results of the previous section. We further 
show that our general upper bounds are nearly optimal by demonstrating the existence of a 
function class whose minimum learning complexity nearly matches our general upper bounds. 

3.2.1 General Upper Bounds on Learning in the SQ Model 

Just as the sample complexity of boosting in the PAC model yields general upper bounds on the 
sample complexity of strong PAC learningrthe queryT query space and tolerance complexities 
of boosting in the SQ model yield general bounds on the queryT query space and tolerance 
complexities of strong SQ learning. 

We can convert any strong SQ learning algorithm into a weak SQ learning algorithm by 
"hardwiring" the accuracy parameter e to a constant. We can then boost this learning algo- 
rithmrvia Scheme 2 for instancerto obtain a strong SQ learning algorithm whose dependence 
on e is nearly optimal. 

Theorem 4 If the class T is strongly SQ learnable, then T is strongly SQ learnable by an al- 
gorithm whose query complexity is O(A log -), whose tolerance is S7(r e/log(l/e)), and whose 
query space complexity is 0(ps(n) log -) when the query space is finite or 0(p 4 (ra) log - log log -) 
when the query space has finite VC-dimension, where N = pi(n,size(f)), r = l/p2(n,size(f)) 
and pi, p 2 , Pz and p^ are polynomials. 

While we have focused primarily on the queryT query space and tolerance complexities of 
SQ learningrwe note that our boosting results can also be applied to bound the timer space 
and hypothesis size complexities of SQ learning. It is easily shown thatTwith respect to erthese 
complexities are bounded by O(log -)TO(log-) and O(log -^respectively. 
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For any function class of VC-dimension cTKearns [19] has shown that learning in the SQ 
model requires Cl(d/ log d) queries each with additive error 0(e). Whereas Kearns simultane- 
ously lower bounds the query complexity and upper bounds the toleranceFwe have simultane- 
ously upper bounded the query complexity and lower bounded the tolerance. Note that the 
tolerance we give in Theorem 4 is optimal to within a logarithmic factor. While Kearns' gen- 
eral lower bound leaves open the possibility that there may exist a general upper bound on the 
query complexity which is independent of eFwe show that this is not the case by demonstrating 
a specific learning problem which requires fl(-^-^ log ^) queries each with additive error 0(e) in 
the SQ model. ThusFwith respect to eFour general upper bound on query complexity is within 
a log - factor of the best possible general upper bound. 

3.2.2 A Specific Lower Bound for Learning in the SQ Model 

In this sectionFwe describe a function class whose minimum learning complexity nearly matches 
our general upper bounds. We begin by introducing a game on which our learning problem is 
based. 

Consider the following two player game parameterized by tTd and N where t < d < N . The 
adversary chooses a set 2 S C [N] of size cTand the goal of the player is to output a set T C [N] 
such that \S A T\ < t. The player is allowed to ask queries of the form Q C [JV] to which the 
adversary returns \Q P\ S\. 

Lemma 1 For any d > 4, t < d/4 and N = £l(d 1+a ) for some a > 0, the player requires 
fl(j^-jlogN) queries of the oracle, in the worst case. 

Proof: Any legal adversary must return responses to the given queries which are consistent 
with some set S C [N] of size d. We construct an adaptiveFmalicious adversary which works 



as 



follows. Let S C 2^ be the set of all ( ) subsets of size d. When the player presents 



the first query Qi C [JV]rthe adversary calculates the value of \S fl Qi\ for every S G So and 
partitions the set S into d + 1 sets Sq,Sq, . . .,Sq where each subset S G S l has \S fl Qi\ = i. 
For i = argmaxj {|<5q |}rthe adversary returns the value i and lets »Si = S l . In generair^ is 
the set of remaining subsets which are consistent with the responses given to the first k queriesr 



We use the standard combinatorial notation [N] = {1, . . . , N}. 
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and the adversary answers each query so as to maximize the remaining number of subsets. Note 
that 15*1 > \S \/(d+l) k = ( N d )/(d+l) k . 

For any S C 2^Twe define width(S) = Taax Stt s es {\S{ A Sj\}. Note that if width(Sk) > 2^r 
then there exist at least two sets 6*1, 6*2 G S^ such that l^i A S 2 \ > 2t. This implies that there 
cannot exist a set T which satisfies both l^i AT\ < t and 15*2 AT| < t (since A is a metric over 
the space of sets which satisfies the triangle inequality property). If the player were to stop 
and output a set T at this pointTthen the malicious adversary could always force the player to 
lose. We now bound width(Sk) as a function of \Sk\- ThisFcombined with our bound on \Sk\ as 
a function of A; r will effectively bound the minimum number of queries required by the player. 
We make use of the following inequalities: 3 

7)' <- (:) * ((:)) * (™ 

For any S C 2^ of width at most wTone can easily show that \S\ < (( )). Thusrif 
$k\ > (L*)) F then width (S^) > 2t. We now note that any k which satisfies the following 



inequality will guarantee that width (Sk) > 2t: 



d 



L J - (G/2)) - ih) < (dTTF - ¥TTy - lSk 

Solving the third inequality for (d + l)*Twe obtain: 

<-^<(f)WXir 

ThusFa lower bound on the number of queries required by the player is 



ft ; ; log N 



log(d + 1) \logd 

for N = tt(d 1+a ). D 

Now consider a learning problem defined as follows. Our instance space X is the set of 
natural numbers ATT and our function class is the set of all indicator functions corresponding to 



We use the standard combinatorial notation ((")) = X^-o \i) 
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subsets of Af of size d. This function class is easily learnable in the SQ model. In what followsr 
we show that any deterministic SQ algorithm for this class requires fl(j^log^) queries with 
additive error 0(e). 

Theorem 5 There exists a parameterized family of function classes which requires fl(-^-^ log ^) 
queries with additive error 0(e) to learn in the SQ model. 

Proof: Consider the two-player game as defined above. For an instance of the game specified 
by tTd and N (where d > 4Tt = d/4 and N = £l(d 1+a ))Twe create an instance of the learning 
problem as follows. We define our distribution D over Af to place weight 4/Nd on each point 
1, . . . , N and to place weight 1 o4/<i on the point JV + 1. Ah other points have zero weight. We 
set e = 1/iV and call the deterministic SQ learning algorithm. Note that the target subset has 
weight 4/iVrso if the SQ algorithm submits a query with additive error greater than 4e = 4/N 
we may answer the query ourselves (as if the target subset were "empty"). For any query x 
submitted with tolerance less than 4eFwe determine the exact answer as follows. Begin with 
an answer of 0. If x(N +1,0)= lTthen add 1 o4/<i to the answer. Determine the following 
three subsets of [JV]: X°TXl and X 2 where x G X° if x(z,0) = 1 and x(%, 1) = OFa; G X{ if 
X(x, 0) = and x(%, 1) = irand x G X 2 if x( x , 0) = 1 and x(%, 1) = 1- Add \X 2 \ ■ 4/Nd to the 
answer. Submit the query X° to the adversaryFand for a response r add (\X°\ Or) • 4/Nd to 
the answer. Submit the query X{ to the adversaryFand for a response r add r ■ 4/Nd to the 
answer. Return the final value of the answer to the SQ algorithm. 

Note that we are able to answer each SQ algorithm query by submitting only two queries 
to the adversaryFand we need not submit any queries to the adversary if the requested additive 
error is greater than 4e. Since i7(^-^logiV) queries of the adversary are requiredrthe SQ 
algorithm must ask O(j^-^logiV) = i7(^-^log ^) queries with additive error 0(e). □ 

Using techniques similar to those found in Kearns' lower bound proof [19]Tthe above proof 
can easily be modified to show that even if the adversary chooses his subset randomly and 
uniformly before the game startsrthen there exists some constant probability with which any 
SQ algorithm (deterministic or probabilistic) will fail if it asks o( ,°^ //;■,■, ) queries with 
additive error 0(e). Fhis result is given in the appendix. 
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3.3 Simulating SQ Algorithms in the Classification Noise Model 

In this sectionrwe describe an improved method for efficiently simulating a statistical query 
algorithm in the classification noise model. The advantages of this new method are twofold. 
FirstTour simulation employs a new technique which significantly reduces the running time of 
simulating SQ algorithms. Secondrour formulation for estimating individual queries is simpler 
and more easily generalized. 

Kearns' procedure for simulating SQ algorithms works in the following way. Kearns shows 
that given a query X^P\ can ^ e written as an expression involving the unknown noise rate r] 
and other probabilities which can be estimated from the noisy example oracle EX^if, D). We 
note that the derivation of this expression relies on x being {0, 1}-valued. The actual expression 
obtained is given below. 

Py = — - — P" + ( 1 O — - — ) p 2 Pl O — — pi (3.7) 

In order to estimate P x with additive error tT& sensitivity analysis is employed to determine 
how accurately each of the components on the right-hand side of Equation 3.7 must be known. 
Kearns shows that for some constants C\ and c 2 rif tj is estimated with additive error Cit(1 o2?]) 2 
and each of the probabilities is estimated with additive error c 2 r(I o2?7;,)rthen the estimate 
obtained for P x from Equation 3.7 will be sufficiently accurate. Since the value of tj is not 
knownrthe procedure for simulating SQ algorithms essentially guesses a set of values for ^r 
{r] ,T]i, . . . , r]i}T such, that at least one rjj satisfies \r]j OpJ < Cir*(l o2?]) 2 where r* is a lower 
bound on the tolerance of the SQ algorithm. Since Cir H ,(Io2?7 J ) 2 < Ci"r H ,(Io2?7) 2 rthe simulation 
uniformly guesses Q( , ^ — ^-) values of r] between and r] h . For each guess of ^rthe simulation 
runs a separate copy of the SQ algorithm and estimates the various queries using the formula 
given above. Since some guess at r] was goodrat least one of the runs will have produced a 
good hypothesis with high probability. The various hypotheses are then tested to find a good 
hypothesisFof which at least one exists. Note that the ^-guessing has a significant impact on 
the running time of the simulation. 

In what followsFwe show a new derivation of P x which is simpler and more easily gener- 
alizable than Kearns' original version. We also show that to estimate an individual P X T it is 
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only necessary to have an estimate of r] within additive error cr(l o2?]) for some constant c. 
We further show that the number of ^-guesses need only be O(^log -^ — )Lthus significantly 
reducing the time complexity of the SQ simulation. 

3.3.1 A New Derivation for P x 

In this sectionrwe present a simpler derivation of an expression for P x . In previous sectionsr 
it was convenient to view a {0, l}-valued x as a predicate so that P x = Fr D [x(x, f(x))]. In 
this sectionrit will be more convenient to view x as a function so that P x = ~E D [x(x, f(x))]. 
Furtherrby making no assumptions on the range of xLthe results obtained herein can easily be 
generalized; these generalizations are discussed in Chapter 5. 

Let X be the instance spacer and let Y = X X {0, 1} be the labelled example space. We 
consider a number of different examples oracles and the distributions these example oracles 
impose on the space of labelled examples. For a given target function / and distribution D 
over XTlet EX(f,D) be the standardr noise-free example oracle. In additionFwe define the 
following example oracles: Let EX(f,D) be the anti-example oracleT EX v cl<l (f, D) be the noisy 
example oracle and EX^if^D) be the noisy anti-example oracle. Note that we have access 
to EXc^(f,D) and we can easily construct EX^if^D) by simply flipping the label of each 
example drawn from EX^ij ', D) . 

Each of these oracles imposes a distribution over labelled examples. Let DfTDfTD V j and 
D V y be these distributionsLrespectively. Note that P x = ~E D [x(x, f(x))] = ~E Dj [x\- 

FinallyLfor a labelled example y = (x,l)Tlet y = (x,l). We define x(y) = x(y)- Note that 
X is a new function whichLon input (x, /)Lsimply outputs x( x J)- The function x is easily 
constructed from x- 

Theorem 6 

Px = E D , M = ^ /^ ^ (3.8) 

Proof: We begin by relating the various example oracles defined above. Recall that the noisy 
example oracle EX^if^D) is defined as follows: Draw an instance x £ X according to D 
and output (x,f(x)) with probability 1 O?? or (x,^f(x)) with probability r]. The draw of x 
is performed randomly and independently for each call to EX^if, D)T and the correct or 
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incorrect labelling of a; is performed randomly and independently for each call to EX^if ', D) . 
In particularrthe correct or incorrect labelling of x is not dependent on the instance x itself. 

Given the independence described ab over we may equivalently define EX^if^D) (and 
EXUf,D)) as follows: 



EXl„(f,D) 



EXUlD) 



EX(f, D) with probability 1 O?? 

EX(f, D) with probability r] 

EX(f, D) with probability 1 O?? 

EX(f, D) with probability r] 



We may use these equivalent definitions to deduce the following: 



EWx] 



Vdj[x] 






Multiplying Equation 3.9 by (1 O??) and Equation 3.10 by r]Twe obtain: 



(3.9) 
(3.10) 



;iO?7)E^[x] 



V^Djix] 






(3.11) 
(3.12) 



Subtracting Equation 3.12 from Equation 3.11 and solving for E £)/ [x]rwe finally obtain: 

1 ^T])E D ,[ X ] ^T]E Dl [ X ] 



e^M 



lo2?7 



To obtain Equation 3.8rwe simply note that E^fx] = E^ffx]. 



D 



Note that in the derivation given abovelVe have not assumed that \ 1S {0, l}-valued. This 
derivation is quite general and can be applied to estimating the expectations of real-valued 
queries. This result is given in Chapter 5. 
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Finallyrnote that if we define 



x"(y) 



1 02?7 



then P x = E D [x] = 'Ed«[x v ]- FhusFgiven a x whose expectation we require with respect to 
the noise-free oracleFwe can construct a new x whose expectation with respect to the noisy 
oracle is identical to the answer we require. Fhis formulation may even be more convenient if 
one has the capability of estimating the expectation of real-valued functions; we discuss this 
generalization in Chapter 5. 

3.3.2 Sensitivity Analysis 

In this sectionFwe provide a sensitivity analysis of Equation 3.8 in order to determine the 
accuracy with which various quantities must be estimated. We make use of the following claim. 

Claim 3 If < a, 6, c, r < 1 and {a = b/c, a = b • c, a = b Oc}, then to obtain an estimate of 
a within additive error t, it is sufficient to obtain estimates of b and c within additive error 
{cr/3, r(V2ol), r /2}, respectively. 

Proof: Fhe a = b/c case is proven in Claim 1. Fhe a = b ■ c case is proven as follows. 



(6 + r(v / 2 0l))-(c + r(V2 0l)) = b ■ c + 6r(V2 Ol) + cr(V2 Ol) + r 2 (V2 Ol) 2 

= a + 6r(v / 2 0l) + cr(v / 2 0l) + r 2 (3 02v / 2) 

< a + r(v / 2 0l) + r(v / 2 0l) + r(3 02v / 2) 

= a + t 

(6 0r(v / 2 0l))-(c0r(v / 2 0l)) = b ■ c o6r(v / 2ol) Ocr^Ol) + r 2 (v / 2ol) 2 

= ao6r(v / 2ol)Ocr(v / 2ol) + r 2 (3o2 V / 2) 

> a0r(v / 2 0l)0r(v / 2 0l) 

> a Or 

Fhe a = b Oc case is trivial. □ 
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Lemma 2 Lett), E^ffx] and~E D <-i[x] be estimates off], E^ffx] and~E D <-i[x] each within additive 
error r(l o2?7)(v2 Ol)/6. Then the quantity 

(1 &rj)E D ,\x] &vE D ,\x] 



lo2?) 
is within additive error t of P x = ~E Dj [x] ■ 

Proof: To obtain an estimate of the right-hand side of Equation 3.8 within additive error tT 
it is sufficient to obtain estimates of the numerator and denominator within additive error 
(f o2ij)r/3. This condition holds for the denominator if tj is estimated with additive error 
(lo2)j)r/6. 

To obtain an estimate of the numerator within additive error (f o2?7)"r/3Lit is sufficient to 
estimate the summands of the numerator with additive error (f o2r;)r/6. Similarlyrto obtain 
accurate estimates of these summandsrit is sufficient to estimate ^LE^xfx] and E^xfx] each 
with additive error (f o2ij)r(v5ol)/6. □ 

Estimates for E^ffx] and E^ffx] are obtained by samplingr and an "estimate" for r] is 
obtained by guessing. We address these issues in the following sections. 

3.3.3 Estimating E D >?[x] an d E D >?[x] 

One can estimate the expected values of all queries submitted by drawing separate samples 
for each of the corresponding x an( i x' s an( i a Pplyi n g Lemma 2. Howeverrbetter results are 
obtained by appealing to uniform convergence. 

Let Q be the query space of the SQ algorithm and let Q = {x '■ X £ Q}- The query space of 
our simulation is Q' = Q U Q. Note that for finite QL|Q'| < 2\Q\. One can further show that 
for all QTVC(Q') < c ■ VC(Q) for a constant c ~ 4.66. This result is given in the appendix. 

If r* is a lower bound on the minimum additive error requested by the SQ algorithm and 
rji is an upper bound on the noise rateLthen by Lemma 2L(I o2j/j)t,(\/2 of)/6 is a sufficient 
additive error with which to estimate all expectations. Standard uniform convergence results 
can be applied to show that all expectations can be estimated within the given additive error 
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using a single noisy sample of size 

„( 1 \Q\ 

mi=0 U(lO2 % ) 2l ° g - 

in the case of a finite query spaceLor a single noisy sample of size 

„f VC(Q) 1 1 , 1 , 1 

in the case of an infinite query space of finite VC-dimension. 

3.3.4 Guessing the Noise Rate rj 

By Lemma 2rto obtain an estimate for P x V\t is sufficient to have an estimate of the noise rate 
7] within additive error (1 o2i])t*(\/2 ol)/6. Since the noise rate is unknownrthe simulation 
guesses various values of the noise rate and runs the SQ algorithm for each guess. If one of the 
noise rate guesses is sufficiently accuraterthen the corresponding run of the SQ algorithm will 
produce the desired accurate hypothesis. 

To guarantee that an accurate ?/-guess is usedrone could simply guess Q( , \ — A values 
of r] spaced uniformly between and r] h . This is essentially the approach adopted by Kearns. 
Note that this would cause the simulation to run the SQ algorithm Q( r ,^_ 2 — r) times. 

We now show that this "branching factor" can be reduced to 0{^- log y^ — ) by constructing 
our ?/-guesses in a much better way. The result follows immediately from the following lemma 
when 7 = Th,(a/2o1)/6 

Lemma 3 For all 7,7/j < 1/2, there exists a sequence of r]-guesses {tj , f]i, .. .,rji} where i = 
0(-log j^7 2 — ) such that for all r] £ [0,7/;,], there exists an rjj which satisfies |t/Ot/j| < 7(1 02r/). 

Proof: The sequence is constructed as follows. Let r] = and consider how to determine rjj 
from i]j_i. The value i]j_i is a valid estimate for all tj > i]j-i which satisfy tj 07(102?/) < 77^ _ 1 . 
Solving for rjTwe find that rjj-i is a valid estimate for all rj £ [rjj _ 1 , n 3 ~ + 1 2 7 ] . Consider an 
Vj > n3 \+2 7 • Th e vaiue Vj 1S a va hd estimate for all rj < rjj which satisfy 77 + 7(1 o2r/) > rjj. 
Solving for r]Twe find that r]j is a valid estimate for all r] £ [ ^~ 7 , i]j]. To ensure that either i]j_i 
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or r]j is a valid estimate for any r] £ [r]j_i,r]j]Twe set 

Vj-i + 1 = Vj<^l 
l + 2 7 lo2 7 ' 

Solving for % in terms of r]j_iTwe obtain 



1 02 7 2 7 



l + 2 7 l + 2 7 

Substituting 7 ' = 2 7 /(l + 2 7 )Twe obtain the following recurrence: 

Vj =(lo2 7 ')-?? J -i+ 7 ' 

Note that if 7 < l/2rthen 7 ' < 1/2 as well. 

By constructing ^-guesses using this recurrencerwe ensure that for all tj £ [0,?7j']rat least 
one of {?7o, • • • , i]i} is a valid estimate. Solving this recurrencerwe find that 

8-1 

%=7'E( lo2 7?+??o(lo2 7 ?. 
i=o 

Since r] = and we are only concerned with r] < r] h Twe may bound the number of guesses 
required by finding the smallest i which satisfies 

8-1 

y]T(io2 7 'y >%. 

i=o 

Given that 

VVn^9VV-V 10(102 7 Q' _ lo(lo2 7 Q' 
7 ft ( 7) - 7 '10(102 7 ') " 2 

we need (1 o2 7 ') 8 < 1 02%. Solving for irwe find that any i > In y^ — /In * , is sufficient. 
Using the fact that 1/x > 1/ln -^ for all a; £ (0, l)Twe find that 

a = • In = • In ■ 



2 7 ' 102% 4 7 102% 

is an upper bound on the number of guesses required. □ 
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3.3.5 The Overall Simulation 

We now combine the results of the previous sections to obtain an overall simulation as follows: 

1. Draw rrii labelled examples from EX^ N (f,D) in order to estimate the expectations in 
Step 2. 

2. Run the SQ algorithm once for each of the 0{^- log y^ — ) ?7-guessesLestimating the various 
queries by applying Lemma 2 and using the sample drawn. 

3. Draw m 2 samples and test the O(^log y^ — ) hypotheses obtained in Step 2. Output one 
of these hypotheses whose error rate is at most e. 

Step 3 can be accomplished by a generalization of a technique due to Laird [21]. The sample 
size required is 

m 2 = 0( ,-, ~ x 9 log(-r— log ■ 



\e(lo2?7 6 ) 2 St* 1o2ij s ' 

Since 1/r* = 0(l/e) for all SQ algorithms [19]Lwe obtain the following theorem on the total 
sample complexity of this simulation. 

Theorem 7 If J 7 is learnable by a statistical query algorithm which makes queries from query 
space Q with worst case additive error r*, then T is PAC learnable in the presence of classifica- 
tion noise. If n h < 1/2 is an upper bound on the noise rate, then the sample complexity required 

is 



O Ud-^ log Jf + 7(Iz^loglog ^ 



when Q is finite or 



( VC(Q) log * + * log i 

when Q has finite VC-dimension. 

By combining our results on general bounds for SQ learning and classification noise simu- 
lationLwe immediately obtain the following corollary. 

Corollary 1 If T is SQ learnable, then T is PAC learnable in the presence of classification 
noise. The dependence on e and n h of the required sample complexity is 0( e2(1 ^ 2 — rj). 
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To determine the running time of our simulationFone must distinguish between two different 
types of SQ algorithms. Some SQ algorithms submit a fixed set of queries independent of 
the estimates they receive for previous queries. We refer to these algorithms as "batch" SQ 
algorithms. Other SQ algorithms submit various queries based upon the estimates they receive 
for previous queries. We refer to these algorithms as "dynamic" SQ algorithms. 4 Note that 
multiple runs of a dynamic SQ algorithm may produce many more queries which need to 
be estimated. Since the vast majority of the time required to simulate most SQ algorithms 
is spent estimating queries using a large sampler the time complexity of simulating dynamic 
SQ algorithms is greatly affected by the "branching factor" of the simulation. By reducing 
the "branching factor" of the simulation from Q( r (1 } 2 — rj) to ©(^logy^ — )Tthe asymptotic 
running time of our simulation is greatly improved. 

With respect to ^rthe running time of our simulation is 0(7^35 — p) - Simon [30] has shown 
a sample and time complexity lower bound of ^( /^ )2 ) for PAC learning in the presence of 
classification noise. We therefore note that the running time of our simulation is optimal with 
respect to the noise rate (modulo lower order logarithmic factors). For dynamic algorithmsr 
the time complexity of our new simulation is in fact a Qijjzi — T2) lector better than the current 
simulation. 



Note that we consider any SQ algorithm which uses a polynomially sized query space to be a "batch" 
algorithm since all queries may be processed in advance. 



Chapter 4 



Learning Results in the 
Relative Error SQ Model 



In this chapteiTwe propose a new model of statistical query learning based on relative error. 
We show that learnability in this new model is polynomially equivalent to learnability in the 
standardradditive error model; howeverrthis new model is advantageous in that SQ algorithms 
specified in this model can be simulated more efficiently in some important cases. 

4.1 Introduction 

In the standard model of statistical query learningra learning algorithm asks for an estimate 
of the probability that a predicate \ 1S true. The required accuracy of this estimate is specified 
by the learner in the form of an additive error parameter. The limitation of this model is 
clearly evident in even the standardly noise-free statistical query simulation [19]. This simula- 
tion uses S7(l/r 2 ) examples. Since 1/r* = 0(l/e) for all SQ algorithms [19]Tthis simulation 
effectively uses 0(l/e 2 ) examples. Howeverrthe e-dependence of the general bound on the 
sample complexity of PAC learning is 0(l/e) [7ril]. 

This 0(1/t^) = 0(l/e 2 ) sample complexity results from the worst case assumption that 
large probabilities may need to be estimated with small additive error in the SQ model. Either 
the nature of statistical query learning is such that learning sometimes requires the estimation of 
large probabilities with small additive errorlbr it is always sufficient to estimate each probability 
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with an additive error comparable to the probability. If the former were the easel 1 then the 
present model and simulations would be the best that one could hope for. We show that the 
latter is truer and that a model in which queries are specified with relative error is a more 
natural and strictly more powerful tool. 

We define such a model of relative error statistical query learning and we show how this 
new model relates to the standard additive error model. We also show general upper bounds 
on learning in this new model which demonstrate that for all classes learnable by statistical 
queriesrit is sufficient to make estimates with relative error independent of e. We then give 
roughly optimal PAC simulations for relative error SQ algorithms. FinallyFwe demonstrate 
natural problems which only require estimates with constant relative error. 

4.2 The Relative Error Statistical Query Model 

Given the motivation abovelwe modify the standard model of statistical query learning to allow 
for estimates being requested with relative error. We replace the additive error STAT(f,D) 
oracle with a relative error Rel-STAT(f,D) oracle which accepts a query xFa relative error 
parameter //rand a threshold parameter 0. The value P x = Fr D [x(x, f(x))] is defined as before. 
If P x is less than the threshold #rthen the oracle may return the symbol _L. If the oracle does 
not return ±Tthen it must return an estimate P x such that 

P x (l^/x)<P x <P x (l + /x) 

Note that the oracle may chose to return an accurate estimate even if P x < 0. A class is said 
to be learnable by relative error statistical queries if it satisfies the same conditions of additive 
error statistical query learning except we instead require that 1/fj, and 1/0 are polynomially 
bounded. Let /i* and 0* be the lower bounds on the relative error and threshold of every query 
made by an SQ algorithm. Given this definition of relative error statistical query learningrwe 
show the following desirable equivalence. 

Theorem 8 T is learnable by additive error statistical queries if and only if T is learnable by 
relative error statistical queries. 
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Proof: One can take any query \ to the additive error oracle which requires additive error r 
and simulate it by calling the relative error oracle with relative error r and threshold r. If 
P x = _l_rthen return 0; elserreturn P x . 

Similarlyrone can take any query to the relative error oracle which requires relative error fj, 
and threshold 9 and simulate it by calling the additive error oracle with additive error fj,0/3. If 
P x < 9(1 0/u/3)rthen return _L; elserreturn P x . 

In each directionrthe simulation uses polynomially bounded parameters if and only if the 
original algorithm uses polynomially bounded parameters. □ 

Kearns [19] shows that almost all classes known to be PAC learnable are learnable with 
additive error statistical queries. By the above theoremrthese classes are also learnable with 
relative error statistical queries. In additionrthe hardness results of Kearns [19] for learning 
parity functions and the general hardness results of Blum et al. [6] based on Fourier analysis 
also hold for relative error statistical query learning. 

4.3 A Natural Example of Relative Error SQ Learning 

In this section we examine a learning problem which has both a simple additive error SQ 
algorithm and a simple relative error SQ algorithm. We consider the problem of learning a 
monotone conjunction of Boolean variables in which the learning algorithm must determine 
which subset of the variables {xi, . . . , x n } are contained in the unknown target conjunction /. 

We construct an hypothesis h which contains all the variables in the target function /rand 
thus h will not misclassify any negative examples. We further guarantee that for each variable 
Xi in hTthe distribution weight of examples which satisfy "a?; = and f(x) = 1" is at most e/n. 
Thereforerthe distribution weight of positive examples which h will misclassify is at most e. 
Such an hypothesis has error rate at most e. 

Consider the following query: Xi( x J) = [( x i = 0) A (/ = 1)]. P Xi is simply the probability 
that Xi is false and f(x) is true. If variable x, is in /rthen P Xi = 0. If we mistakenly include a 
variable x, in our hypothesis which is not in /rthen the error due to this inclusion is at most 
P Xi . We simply construct our hypothesis by including all target variablesrbut no variables x, 
for which P Xi > e/n. 
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An additive error SQ algorithm queries each \i with additive error e/2n and includes all 
variables for which the estimate P Xi < e/2n. Even if P Xi = l/2rthe oracle is constrained to 
return an estimate with additive error less than e/2n. A relative error SQ algorithm queries each 
Xi with relative error 1/2 and threshold e/n and includes all variables for which the estimate 
P Xi is or _L. 

The sample complexity of the standardrhoise-free PAC simulation of additive error SQ algo- 
rithms depends linearly on 1/r 2 [19]rwhile in Section 4.5rwe show that the sample complexity 
of a noise-free PAC simulation of relative error SQ algorithms depends linearly on l/// 2 ^. Note 
that in the above algorithms for learning conjunctionsn/r 2 = 0(n 2 /e 2 ) while l/// 2 ^ = 0(n/e). 
We further note that /i* is constant for learning conjunctions. We show in Section 4.4 that no 
learning problem requires /i* to depend on e and in Section 4.6 that /i* is actually a constant 
in many algorithms. 

4.4 General Bounds on Learning in the Relative Error SQ Model 

In this sectionrwe prove general upper bounds on the complexity of relative error statistical 
query learning. We do so by applying boosting techniques [14ri5r28] and specificallyr these 
techniques as applied in the statistical query model. We first prove some useful lemmas which 
allow us to decompose relative error estimates of ratios and sums. 

Lemma 4 If < a, 6, c, //, 0, $ < 1 and a = b/c, then to estimate a with (//, 0) error provided 
that c > $, it is sufficient to estimate c with (/i/3, $) error and b with (p/3,0^/2) error. 

Proof: If the estimate c is ± or less than $(1 0/u/3)rthen c < $. Therefore an estimate for a 
is not requiredrand we may halt. Otherwise c > $(1 0/u/3)Tand therefore c > <E ~ h '', 3 > $/2. 
If the estimate b is ±rthen b < 0Q/2. Therefore a = b/c < #rso we may answer a = _L. 
0therwiser6 and c are estimates of b and cFeach within a 1 ± /i/3 factor. The theorem follows 
by noting the following facts. 
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H1 + ///3) = l + M/3 

c-(lo^/3) ~ "'lO/i/3 

= a - (1 + ^ /3) ( 1 + T^) 

< a-(l + /z/3)(l + /z/2) 

= a-(l + /x/3 + /x/2 + /x 2 /6) 

< a-(l + /x) 

6-(1q m /3) _ lOji/3 

c • (1 + /x/3) ~ a ' 1 + ///3 

> a .(lo/i/3)(10/i/3) 

= a-(lo2///3 + ^ 2 /9) 



a 



Lemma 5 7f < s,pi,Zi,fj, < 1, J2iPi ^ 1 an <^ s = J2iPi z i> then to estimate s with (fJ,,9) 
error, it is sufficient to estimate each z, with (/i/3,/i#/3) error provided that the pi coefficients 
are known. 

Proof: Let B = {i : estimate of z, is -L}L E = {i : estimate of z, is i 8 }L s B = J2 B Pi z i an( I 
s E = J2 E Pi z i- Note that s B < 9/2/3. Let s E = J2 E Pi z i- If *b < ^(1 ^m/3) 2 then we return _I_L 
otherwise we return J B . 

If s E < 6'(l0 / u/3) 2 Lthens B < 0(l<3/i/3). But in this case s = s E + s B < 8(l<£p/3) + 8fi/3 = 9T 
so we are correct in returning _L. 

Otherwise we return s E which is at least 9(1 O/u/3) 2 . If B = 0Lthen it is easy to see that 
s E is within a 1 ± /i/3 (and therefore 1 ± /i) factor of s. OtherwiseLwe are implicitly setting 
Zi = for each i G -BLand therefore it is enough to show that s E > s(l O/u). 
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Since s E > 9(1 0,u/3) 2 rwe have s E > 9(1 0;u/3) 2 /(l + /i/3). Using the fact that for all 
fj, < ir(l o///3)/(l + /x/3) > l/2rwe have s E > 9(1 0///3)/2. If s E > (0/z/3 + s E )(l o^rthen 
s E > s(l O/u) since s B < 9/^/3 and s = s B + s E . But since s B > s E (l 0/u/3)rthis condition 
holds when s E (l O/u/3) > (9/^/3 + s B )(l O/u). Solving for s^Tthis final condition holds when 
s E > 9(1 0/u)/2 which we have shown to be true whenever an estimate is returned. □ 

Theorem 9 If the concept class T is strongly SQ learnable, then T is strongly SQ learnable by 
an algorithm whose query complexity is O(N log -), whose minimum requested relative error is 
i7(// ) and whose minimum requested threshold is £l(j2 9 e/ log(l/e)) where N = pi(n, size(f)), 
l_i = l/p 2 (n, size(f)) and 9 = l/p 3 (n, size(f)) for some polynomials p 1} p-i and p 3 . 

Proof: If T is strongly SQ learnablerthen there exists a relative error statistical query algo- 
rithm A for learning T . Hardwire the accuracy parameter of A to 1/4 and apply Scheme 2 
boosting. The boosting scheme will run 161n(l/e) copies of A with respect to 161n(l/e) different 
distributions over the instance space. Each run makes at most N = JV*(l/4, ra, size(f)) queriesr 
each with relative error no smaller than /2 = ^(1/4, ra, size(f)) and threshold no smaller than 
9 = ^,,(1/4, ra, size(f)). In run i-\- lTthe algorithm makes queries to STAT(f, D i+ i) where D i+ i 
is a distribution based on D. Since we only have access to a statistics oracle for DTqueries to 
STAT(f,D i+ i) are simulated by a sequence of new queries to STAT(f,D) as follows: 

STATifD )h( X f( x ))] ^J=0 X J- STAT ^D)[X^XJ] 

In the above equation w < irthe values XJ £ [0,1] are knownrand ^\- XJ < 1. Also note 
that if the denominator of Equation 4.1 is less than $ = ^ n( ~ € ,K Tthen the query need not be 
estimated (this is the "abort" condition of Scheme 2). Applying Lemmas 4 and 5rwe find that 
the queries in the denominator can be estimated with (p /9, /2 ^/9) errorFand the queries in 
the numerator can be estimated with (p /9, /2 9 ^/18) error. Since a query to STAT(f,D i+ i) 
requires 0(i) queries to STAT(f, _D)Tthe total number of queries made is O(N log (1/e)). □ 

We finally note that the query space complexity obtained here is identical in form to the 
query space complexity obtained in Section 3.1.2. 
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4.5 Simulating Relative Error SQ Algorithms in the PAC Model 

In this sectionrwe derive the complexity of simulating relative error SQ algorithms in the PAC 
modeirboth in the absence and presence of noise. We also give general upper bounds on the 
complexity of PAC algorithms derived from SQ algorithms based on the simulations and the 
general bounds of the previous section. Note that there do not exist two-sided bounds for 
uniform convergence based on VC-dimensionPso some of our results are based on drawing a 
separate sample for each query. 

4.5.1 PAC Model Simulation 

Phe simulation of relative error SQ algorithms in the noise-free PAC model is based on a 
Chernoff bound analysis. Let GE(p,m,n) be the probability of at least n successes in m 
Bernoulli trialsPwhere each trial has probability of success p. Similarlyriet LE(p, m, n) be the 
probability of at most n successes in m Bernoulli trialsPwhere each trial has probability of 
success p. Chernoff 's bounds may then be stated as follows [3]: 

GE(p,m,mp(l + a)) < e mpa2/3 
LE(p,m,mp(l&a)) < e mpa2/2 

FurthermorePwe often make use of the following properties of GE and LE: 

p > p' =>■ LE(p,m,n) < LE(p' ,m,n) (4-2) 

p < p' =>■ GE(p, to, n) < GE(p', to, n) (4-3) 

We may now prove following theorem. 

Theorem 10 If T is learnable by a statistical query algorithm which makes at most N* queries 
from query space Q with worst case relative error n* and worst case threshold 9*, then T is 
PAG learnable with sample complexity O ( -j^- log ^ ) when Q is finite or O ( -^f- log ^f- ) when 
drawing a separate sample for each query. 
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Proof: We first demonstrate how to estimate the value of a single queryrand we then extend 
this technique to yield the desired result. Let [x,^,0] be a query to be estimatedr and let 
p = P x . For a given sample of size raHet p be the fraction of examples which satisfy \- I n order 
to properly estimate the value of this queryTwe choose m large enough to ensure that each of 
the following hold with high probabilty: 

1. Ifp< 6»/2Lthenp< 9. 

2. Ifp> 6»/2Lthenp> 9/4. 

3. If p > 6»/4Lthen p > (1 O^p. 

4. If p > 6»/4Lthen p < (1 + n)V- 

Thusrif p < 9/2Twe may output _l_Land if p > 9/2Twe may output p. To ensure a failure 
probability of at most ST we choose m large enough to guarantee that each of the properties 
fails to hold with probability at most S/4. Let m = -^-ln |. 

Suppose that p > 9. Then the probability that p < 9/2 is bounded by: 

LE(p,m,0/2) < LE(0,m,0/2) 

< e ~me/8 

Since m > | In |Lthis probability is less than S/4. ThereforeLthe probability that p > 9/2 is at 
least 1 o<5/4. ThusLwe have shown the following: With probabilty at least 1 o<5/4L 

p>9=^p> 9/2. 

Since Property 1 is the contrapositive of the above statementLwe have shown that it will fail 
to hold with probability at most S/4. 

Property 2 is shown to hold in a similar mannerLand Properties 3 and 4 are direct conse- 
quences of Chernoff bounds. 

NowLby choosing m = — 1|— In -^-Lwe can ensure that all four properties will hold for all 
X & QLwith probability at least 1 0<5. IfTon the other handLwe draw iV* separate samples 
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each of size m = — 1|— In ^y^Twe guarantee that all four properties will hold for each of the A* 
queries estimatedrwith probability at least 1 0<5. □ 

Corollary 2 If J 7 is SQ learnable, then T is PAC learnable with a sample complexity whose 
dependence on e is 0(l/e). 

Although one could use boosting techniques in the PAC model to achieve this nearly optimal 
sample complexityrthese boosting techniques would result in a more complicated algorithm and 
output hypothesis (a circuit whose inputs were hypotheses from the original hypothesis class). 
If instead we have a relative error SQ algorithm meeting the bounds of Theorem 9rthen we 
achieve this PAC sample complexity directly. 

4.5.2 Classification Noise Model Simulation 

For SQ simulations in the classification noise modeirwe achieve the sample complexity given 
in Theorem 11 below. This sample complexity is obtained by simulating an additive error SQ 
algorithm with r = fj,0/3 as in Theorem 8. Although this result does not improve the sample 
complexity of SQ simulations in the presence of classification noiseFwe believe that to improve 
upon this bound requires the use of relative error statistical queries for the reasons discussed 
in Section 4.1. 

Theorem 11 If J 7 is learnable by a statistical query algorithm which makes queries from query 
space Q with worst case relative error n* and worst case threshold 9*, then T is PAC learnable 
in the presence of classification noise. If r] h < 1/2 is an upper bound on the noise rate, then the 
sample complexity required is 

Ug;(i-2, t )» lo S ¥ + HJ^ lo S lo S Trb) 
when Q is finite or 

Q ( VC(Q) ! 1 , 1 ! l\ 

\^ 2 Jl(l-^b) 2 S ^.9.(1-217,,) ^ ^ 2 J 2 (l-2 Vh f iU & Sj 

when Q has finite VC-dimension. 

Corollary 3 If T is SQ learnable, then T is PAC learnable in the presence of classification 
noise. The dependence on e and r] h of the required sample complexity is 0{ e2(1 \ 2 — ^). 
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4.5.3 Malicious Error Model Simulation 

We next consider the simulation of relative error SQ algorithms in the presence of malicious 
errors. Decatur [9] has shown that an SQ algorithm can be simulated in the presence of malicious 
errors with a maximum allowable error rate which depends on r^Tthe smallest additive error 
required by the SQ algorithm. In Theorem 12rwe show that an SQ algorithm can be simulated 
in the presence of malicious errors with a maximum allowable error rate and sample complexity 
which depend on /i* and ^Tthe minimum relative error and threshold required by the SQ 
algorithm. 

The key idea in this simulation is to draw a large enough sample such that for each queryT 
the combined error in an estimate due to both the adversary and the statistical fluctuation on 
error-free examples is less than the accuracy required. We formally state this idea in the claim 
given below. 

Claim 4 Let P* be the fraction of examples satisfying x in a noise-free sample of size m, and 
let P x be the fraction of examples satisfying \ i n a sample of size m drawn from EX^ Ah (f, D). 
Then to ensure |P x oP x | < Ti+t 2 , it is sufficient to draw a sample of size m which simultaneously 
ensures that: 

(1) The adversary corrupts at most a t x fraction of the examples drawn from EX^ Ah (f,D). 

(2)\P x ^P*\<r 2 . 

Theorem 12 If J 7 is learnable by a statistical query algorithm which makes at most iV* queries 
from query space Q with worst case relative error n* and worst case threshold 9*, then T is 
PAC learnable in the presence of malicious errors. The maximum allowable error rate is /?* = 
O(/i*0*), and the sample complexity required is 0(—ij- log '-^-) when Q is finite or 0(-^f- log ^j-) 
when drawing a separate sample for each query. 

Proof: We first analyze the tolerable error and sample complexity for simulating a single query 
and then determine these values for simulating the entire algorithm. 

For a given query [\, /i, 9]TP X is the probability with respect to the noise-free example oracle 
which needs to be estimated with (fJ,,9) error. Assume that (3 < fj,0/16 and let (3 be the actual 
fraction of the sample corrupted by the malicious adversary. We choose m large enough to 
ensure that the following hold with high probability: 
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1. If (5 < /z0/16rthen (3 < /i9/8. 

2. If P* < 56»/8rthen P x < 9. 

3. If P* > 36»/8rthen P x > 0/4. 

4. If P x > 6»/4rthen P* > (1 0///2)P x . 

5. If P x > 6»/4rthen P* < (1 + m/2)P x . 

Suppose that Properties 1 through 5 all hold. If P x < #/2rthen by Property 1TP* < 5#/8rand 
by Property 2rP x < 0. ThusFwe may return _L. 

IlTon the other handr P x > 6»/2rthen by Property irP^ > 36»/8rand by Property 3r 
Px — ^/4- Property 4 then implies that P* > (1 0/u/2)P x Tand by Property lTwe have the 
following: 

P x > (1 0///2)P x &H0/8 > (1 0///2)P x 0//P x /2 = (1 0/i)P x 

By applying Property 5rwe may similarly show the following: 

P x < (1 + M /2)P X + 1*9/8 < (1 + M/2)P X + M^x/2 = (1 + H)P X 

ThusFwe may return P x . 

We can ensure that Properties 1 through 5 collectively hold with probability at least 1 0<5 
by letting m = -^|Tn|. The proofs that each of these properties hold with high probability 
given this sample size are analogous to the proofs for the similar properties used in Theorem 10. 

Nowrby choosing m = -ff-ln-^Twe can ensure that all five properties will hold for all 
X & Qrwith probability at least 1 0<5. IiTon the other handr we draw iV* separate samples 
each of size m = -f§- In ^y^Twe guarantee that ah five properties will hold for each of the iV* 
queries estimatedrwith probability at least 1 0<5. □ 

Corollary 4 If T is SQ learnable, then T is PAC learnable in the presence of malicious errors. 
The dependence on e of the maximum allowable error rate is 0(e), while the dependence on e 
of the required sample complexity is 0(l/e). 

Note that we are within logarithmic factors of both the 0(e) maximum allowable mali- 
cious error rate [20] and the 0(l/e) lower bound on the sample complexity of noise-free PAC 
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learning [11]. In this malicious error tolerant PAC simulationr the sampler timer space and 
hypothesis size complexities are asymptotically identical to the corresponding complexities in 
our noise-free PAC simulation. 

4.6 Very Efficient Learning in the Presence of Malicious Errors 

In previous sectionsFwe have shown general upper bounds on the required complexity of relative 
error SQ algorithms and the efficiency of PAC algorithms derived from them. In this sectionr 
we describe relative error SQ algorithms which actually achieve these bounds and therefore 
have very efficient T malicious error tolerant PAC simulations. We first present a very efficient 
algorithm for learning conjunctions 1 in the presence of malicious errors when there are many 
irrelevant attributes. We then highlight a property of this SQ algorithm which allows for its 
efficiencyrand we further show that many other SQ algorithms naturally exhibit this property 
as well. We can simulate these SQ algorithms in the malicious error model with roughly optimal 
malicious error tolerance and sample complexity. 

Decatur [9] gives an algorithm for learning conjunctions which tolerates a malicious error 
rate independent of the number of irrelevant attributesrthus depending only on the number 
of relevant attributes and the desired accuracy. This algorithmr while reasonably efficientris 
based on an additive error SQ algorithm of Kearns [19] and therefore does not have an optimal 
sample complexity. 

We present an algorithm based on relative error statistical queries which tolerates the same 
malicious error rate and has a sample complexity whose dependence on e roughly matches the 
general lower bound for noise-free PAC learning. 

Theorem 13 The class of conjunctions of size k over n variables is PAC learnable with ma- 
licious errors. The maximum allowable malicious error rate is O(-^-r), and the sample com- 
plexity required is ( ^- log ^ logra + ^ log^ log| ) . 



Proof: We present a proof for learning monotone conjunctions of size kT&nd we note that this 
proof can easily be extended for learning non-monotone conjunctions of size k. 



By duality, identical results also hold for learning disjunctions. 
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The target function / is a conjunction of k variables. We construct an hypothesis h which 
is a conjunction of r = O(klog-) variables such that the distribution weight of misclassified 
positive examples is at most e/2 and the distribution weight of misclassified negative examples 
is also at most e/2. 

FirstTall variables which could contribute more than e/2r error on the positive examples 
are eliminated from consideration. This is accomplished by using the same queries that the 
monotone conjunction SQ algorithm of Section 4.3 uses. The queries are asked with relative 
error 1/2 and threshold e/2r. 

NextTthe negative examples are greedily "covered" so that the distribution weight of mis- 
classified negative examples is no more than e/2. We say that a variable covers all negative 
examples for which this variable is false. We know that the set of variables in / is a cover of 
size k for the entire space of negative examples. We iteratively construct h by conjoining new 
variables such that the distribution weight of negative examples covered by each new variable 
is at least a ^ fraction of the distribution weight of negative examples remaining to be covered. 

Given a partially constructed hypothesis hj = x ix A x i2 A • • • A a;. Tret X~- be the set of 
negative examples not covered by hjTi.e. X~- = {x : (f(x) = 0) A (hj(x) = 1)}. Let D~- be the 
conditional distribution on X~- induced by DTi.e. for any x £ X~-T D~-(x) = D(x)/ ' D(X~-). By 
definitionrXg is the space of negative examples and D~ is the conditional distribution on XJ. 
We know that the target variables not yet in hj cover the remaining examples in X~; hencer 
there exists a cover of X~- of size at most k. Thus there exists at least one variable which covers 
a set of negative examples in X~ whose distribution weight with respect to D~j is at least 1/k. 

Given hjTioi each a^Tlet Xj,i( x J) = [^-1-^] = [ x i = 0|G = 0) A (hj(x) = 1)]. Note that 
P Xj i is the distribution weightTwith respect to D~jT of negative examples in X~ covered by 
Xi. Thus there exists a variable x, such that P Xi is at least 1/k. To find such a variableFwe 
ask queries of the above form with relative error 1/3 and threshold 2/3k. [Note that this is a 
query for a conditional probabilityFwhich must be determined by the ratio of two unconditional 
probabilities. We show how to do this below.] Since there exists a variable x, such that 
P\ ■ i — 1/kTwe are guaranteed to find some variable av such that the estimate P x . is at least 
i(lo|) = ^. Note that if P x .., > ^rthen P x .., > ^/(1 + |) = jj:- Thusrby conjoining x^ to 
hjTwe are guaranteed to cover a set of negative examples in X~- whose distribution weight with 
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respect to D~- is at least l/2k. Since the distribution weightTwith respect to D~ Tof uncovered 
negative examples is reduced by at least a (lo|:) factor in each iterationRt is easy to show that 
this method requires no more than r = O(klog-) iterations to cover all but a set of negative 
examples whose distribution weightTwith respect to D~ (and therefore with respect to D) is at 
most e/2. 

We now show how to estimate the conditional probability query [A|_B] with relative error 
H = 1/3 and threshold 9 = 2/3k. We estimate both queries which constitute the standard 
expansion of the conditional probability. Appealing to Lemma 4rwe first estimate [_B]Tthe 
probability that a negative example is not covered by /fusing relative error /i/3 = 1/9 and 
threshold e/2. If this estimate is _L or less than |(1 o|) = yTthen the weight of negative 
examples misclassified by h is at most e/2T so we may halt and output h. Otherwiser we 
estimate [A A B] with relative error /i/3 = 1/9 and threshold 9(e/2)/2 = -^. If this estimate 
is _l_rthen we may return _l_Tand if a value is returnedrthen we can return the ratio of our 
estimates for [A A B] and [B] as an estimate for [A|_B]. 

For this algorithmrthe worst case relative error is 0(l)rthe worst case threshold is $l( kl * i )T 
and log \Q\ = O(klog-logn). Thereforerthe theorem follows from Theorem 12. □ 

An important property of this statistical query algorithm is that for every queryTwe need 
only to determine whether P x falls below some threshold or above some constant fraction of this 
threshold. This allows the relative error parameter fj, to be a constant. The learning algorithm 
described in Section 4.3 for monotone conjunctions has this propertyTand we note that many 
other learning algorithms which involve "covering" also have this property (e.g. the standard 
SQ algorithms for learning decision lists and axis parallel rectangles). In all these cases we 
obtain very efficientrmalicious error tolerant algorithms. 
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Extensions 



Throughout this thesisTwe have assumed that queries submitted to the statistical query oracle 
were restricted to being {0, l}-valued functions of labelled examples. In this caseTthe oracle 
returned an estimate of the probability that x( x ,f( x )) = 1 on an example x chosen randomly 
according to D. 

We now generalize the SQ model to allow algorithms to submit queries which are real- 
valued. FormallyTwe define a real- valued query to be a mapping from labelled examples to the 
real interval [0,M]Tx ■ X X {0, 1} -► [0,M]} We define P x to be the expected value of x^ 
P x = E D [x(xJ(x))] = E Df [ X }. 

This generalization can be quite useful. If the learning algorithm requires the expected value 
of some function of labelled examplesTit may simply specify this using a real- valued query. By 
suitably constructing new queriesTthe learning algorithm may calculate variance and other 
moments as well. This generalization gives the algorithm designer more freedom and power. 
Furthermorerthe ability to efficiently simulate these algorithms in the PAC modeirin both the 
absence and presence of noiseTis retained as shown below. 

The results given below are proven almost identically to their counterparts by simply ap- 
plying Hoeffding and Chernoff style bounds for bounded real random variables. The following 
is a simple extension of results contained in McDiarmid [23]: 



The range [0, M] is used so that we can derive efficient simulations of relative error SQ algorithms. For 
additive error SQ algorithms, one may consider any interval [a, b] where M = b — a. 

67 



68 Extensions 



Theorem 14 Let Xi,X 2 , ■ ■ -,X m be independent and identically distributed random variables 
where < X, < M and p = E[X 8 ], and let p = — YlT=i -^i- For any a > 0, 



Pr[p>p + a] < e - 2 ™« 2 /M 2 
Pr[p<pOa] < e - 2ma2 ' M2 



For any 7, < 7 < 1, 



Pr[p > p(l + 7)] < e -™P7 2 /3M 
Pr[p<p(l0 7 )] < e -™PT 2 /2M_ 

Note that when M = lPthe following sample complexities and noise tolerances are essentially 
identical to those for {0, l}-valued queries. 

Theorem 15 If T is learnable by a statistical query algorithm which makes at most iV* [0,M]- 
valued queries from query space Q with worst case additive error r*, then T is PAC learnable 
with sample complexity O(^-log^) when Q is finite or O ( N *J^ log ^f- ) when drawing a sep- 
arate sample for each query. 

Theorem 16 If T is learnable by a statistical query algorithm which makes at most N* [0,M]- 
valued queries from query space Q with worst case relative error n* and worst case thresh- 
old 6*, then T is PAC learnable with sample complexity 0(-f|- log ^) when Q is finite or 
0( N 2^ log ^f-) when drawing a separate sample for each query. 

Theorem 17 If T is learnable by a statistical query algorithm which makes at most iV* [0,M]- 
valued queries from query space Q with worst case additive error r*, then T is PAC learnable 
in the presence of classification noise. If r] h < 1/2 is an upper bound on the noise rate, then the 
sample complexity required is 

O ( 7I ^ log If + _L_ log log ^ 

when Q is finite or 

o (^g^iog^ + ^a^iogiog^ 
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when drawing a separate sample for each query. 

Theorem 18 If T is learnable by a statistical query algorithm which makes at most N* [0,M]- 
valued queries from query space Q with worst case relative error n* and worst case threshold 9*, 
then T is PAC learnable in the presence of classification noise. If r] h < 1/2 is an upper bound 
on the noise rate, then the sample complexity required is 

O ( Ml i oe M + I io ff be 1 

when Q is finite or 

V/jJ9J(l-2iJb) 2 S ~~t + e (l_2i; b )2 ^°§ ^°§ l-2i; b 

w/ien drawing a separate sample for each query. 

Theorem 19 7f J 7 is learnable by a statistical queries algorithm which makes at most iV* [0, M]- 
valued queries from query space Q with worst case additive error r*, then T is PAC learnable 
in the presence of malicious errors. The maximum allowable error rate is Flfa/M) and the 
sample complexity required is O(^-log^) when Q is finite or 0( N * T M log ^j-) when drawing 
a separate sample for each query. 

Theorem 20 If T is learnable by a statistical queries algorithm which makes at most iV* [0, M]- 
valued queries from query space Q with worst case relative error n* and worst case threshold 
9*, then T is PAC learnable in the presence of malicious errors. The maximum allowable error- 
rate is fl(fi^9^/M) and the sample complexity required is 0(-f|-log ^) when Q is finite or 
0( N 2 M log ?j-) when drawing a separate sample for each query. 
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We have examined the statistical query model of learning and derived the first general bounds 
on the complexity of learning in this model. We have further shown that our general bounds are 
nearly optimal in many respects by demonstrating a specific class of functions whose minimum 
learning complexity nearly matches our general bounds. We have also improved the current 
strategy for simulating SQ algorithms in the classification noise model by demonstrating a new 
simulation which is both more efficient and more easily generalized. 

The standard statistical query model of learning has a number of demonstrable deficienciesT 
and we have proposed a variant of the statistical query model based on relative error in order to 
combat these deficiencies. We have demonstrated the equivalence of additive error and relative 
error SQ learnabilityTand we have derived general bounds on the complexity of learning in this 
new relative error SQ model. We have demonstrated strategies for simulating relative error SQ 
algorithms in the PAC modelTboth in the absence and presence of noise. Our simulations in 
the absence of noise and in the presence of malicious errors yield nearly optimal noise-tolerant 
PAC learning algorithms. 

FinallyTwe have shown that our results in both the additive and relative error SQ models 
can be extended to allow for real- valued queries. 

The question of what sample complexity is required to simulate statistical query algorithms 
in the presence of classification noise remains open. The current simulations of both additive 
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and relative error SQ algorithms yield PAC algorithms whose sample complexities depend 
quadraticly on 1/e. Howeverrin the absence of computational restrictionsFall finite concept 
classes can be learned in the presence of classification noise using a sample complexity which 
depends linearly on 1/e [21]. It seems highly unlikely that a 0(1/ e) strategy for simulating 
additive error SQ algorithms exists; howeverFsuch a strategy for simulating relative error SQ 
algorithms seems plausible. This line of research is currently being pursued. 

As discussed in Section 4.6rmany classes which are SQ learnable have algorithms with a 
constant worst case relative error /i*. Can one show that all classes which are SQ learnable 
have algorithms with this propertyTor instead characterize exactly which classes do? 
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Appendix 



In this chapteiTwe prove a number of technical results used in the previous chapters. 

A.l The Finite Query Space Complexity of Boosting 

In this section of the appendixTwe show how to simplify the expression for the size of the query 
space of boosting and how to derive an expression for the size of the query space of hybrid 
boosting. These results apply when the query space and hypothesis class are finite. 

A. 1.1 The Size of the Query Space of Boosting 

In Section 3.1Tthe following expression was obtained for the size of the query space of boosting: 

\Qb\ = \Qo\ + Y,( 1 + 1 )-( • ) + |Qol5> + 1 )'( l • (A.l) 

i=i \ i / i=1 \ i / 

where k is k x or k 2 depending on the type of boosting used. 

We begin by simplifying the expression ^2 i=1 (i + 1) • ( + '~ ) • In order to obtain a closed-form 
expression for this sumTwe first eliminate the (i + 1) factor. 

,. .,, 'N + i&l\ . (JV + iol)! /iV + iol 

[i + 1 • . = i ' ' 



(iVol)!i! V i 
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(JV + iol)! , fN + i&l 



JV!(iol)! V i 
/jV-Hol\ /Aol-H 



Using the fact that £™ ( n + 8 ) = ( n+ ™ +1 )rwe now have the following: 

Applying this fact to Equation A.l aboverwe obtain the following closed-form expression: 

\Qb\ = (IQol + l) ( ln °[ + *) + IWoKIQol + l) C^V) 01 (A - 2) 

In order to bound the above expressionFwe make use of the following inequality: 

'n + m\ (n + m)(n + m Ol) • • -{n + 1) 



to / ra(ra Ol) • • • 1 



n \ / n \ I n 

1 + - 1+ ••• 1 + - 

ml \ m Ol/ V 1 



< (l + n) m 
Applying this inequalityFwe now have: 

\Qb\ < (|Qo| + l)(|Wo| + l) & + |Wo|(|Qo| + l)(|W | + 2) J; - 1 Ol 

< (|Qo| + i)(\n \ + 2) k + (|Qo| + i)(\n \ + 2) k 

= 2(|Qo| + l)(|fto| + 2) & (A.3) 

The complexity of simulating an SQ algorithm depends on log \Qb\- We have effectively shown 
the following: 

log |Q B | = 0(log |Qo| + Hog |fto|) 
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A. 1.2 The Size of the Query Space of Hybrid Boosting 

In the hybrid boosting schemerthe Scheme 1 and Scheme 2 boosting schemes are combined to 
obtain improved overall complexities. The Scheme 2 booster uses the Scheme 1 boosterrrun 
with e = l/4ras its "weak learnerF while the Scheme 1 booster uses the actual weak learner. 
Thusr/?! = &i(7, 1/4) and k 2 = & 2 (l/4, e). Let Qhb be the query space of the hybrid boosterr 
and let Q1/4 and TC1/4 be the query space and hypothesis class of the Scheme 1 booster. By the 
results of the previous sectionFwe have the following: 

\Qhb\ < 2(\Q 1/4 \ + l)(\n 1/4 \ + 2p 
\Q 1/4 \ < 2(\Q \ + l)(\n \ + 2)^ 

The hypotheses in TC1/4 are majority functions of up to £7 hypotheses from 7i . The number 
of unique majority functions of i hypotheses from TL is given by (' °' +8_1 )rand therefore the 
number of unique majority functions of up to k x hypotheses from TL is given by J2iLi ( ° )• 
Using the techniques of the previous sectionFwe can simplify this expression as follows: 

IW1/4I = !,{ t ) 

i=0 v ' J 

= ( lw i;")- 

< (|W | + l) fcl Ol 

Combining these resultsFwe obtain the following: 

\Q HB \ < 2 (2(| Qo| + l)(|Wo| + 2) &1 + 1) ((\n \ + l)* 1 + l)" 2 

The complexity of simulating an SQ algorithm depends on log |Q_hb|. We have effectively shown 
the following: 

log I Qhb I = O(log|Q | + &i&2 log | H 1) 
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A. 2 Proofs Involving VC-Dimension 

In this section of the appendLxTwe prove a number of technical lemmas which involve the concept 
of Vapnik-Chervonenkis dimension [36]. We begin by defining VC-dimension and introducing 
a number of preliminary results. 

A. 2.1 Preliminaries 

Let Q be a set of {0, l}-valued functions defined over a domain X . For any countable set S = 
{xi, . . .,x m } C X and function g G QTg defines a labelling of S as follows: (g(xi), . . .,g(x m )). 
S is said to be shattered by Q if S can be labelled in all possible 2 m ways by functions in Q . 
The VC-dimension of QTVC{Q)Y\s defined to be the cardinality of the largest shattered set. 

VC-dimension is often defined in terms of set-theoretic notation. One can view a function 
g G Q as an indicator junction for a set X g C X where X g = {x £ X : g(x) = 1}. For any set 
S C XTlet ILg(S) = {S fl X g : g G Q}. One can view ILg(S) as the set of subsets of S "picked 
out" by functions in Q. Note that if ILg(S) = 2 s rthen S is shattered by Q. For any integer 
m > lTlet Hg(m) = m&x{\Hg(S)\ : S C X,\S\ = m}. One can view Hg(m) as the maximum 
number of subsets of any set of size m "picked out" by functions in Q . Note that if rig (to) = 2 m r 
then there exists a set of size m shattered by Q . One may define VC-dimension in terms of 
n c (m) as follows: VC(Q) = max {to : Ilg(m) = 2 m }. 

We next prove a lemma concerning rig (to) which is used extensively in the sections that 
follow. 

Lemma 6 If Q = Q 1 U Q 2 , then II (ra) < II 01 (ra) + II 02 (ra). 

Proof: For any rrariet S m be a set of size m such that |IL (5' m )| = Hg(m). Note that such a 
set is guaranteed to exist by the definition of Hg(m). We next note that ILg(S m ) = Hg 1 (S m ) U 
ng 2 (5' m )rand therefore lIEg^m)! < lllg^^m)! + |n 02 (5' m )|. The proof is completed by noting 
that |n 01 (5 m )| < n Cl (m) and \Ilg 2 (S m )\ < n Ca (m). □ 

The growth of the rig (to) function plays an important role in proving a number of results 
in PAC learning. Note that for any m < VC(G)TTlg(m) = 2 m . The following result due to 
Sauer [27] upper bounds the growth of rig (to) for all m > VC{Q). 
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Lemma 7 (Sauer's Lemma) Let Q be a set of {0, 1} -valued functions, and let d = VC{Q). 

d 

For all integers m> d, Hg(m) < Y (™)- 

8 = 

Blumer et al. [7] have shown that for all integers m > d > ir^ i=0 ( m ) < (em/d) d where e 
is the base of the natural logarithm. We present a new and simpler proof of this result below. 

d 

Lemma 8 For all integers m > d > 1, Y ( m ) < (em/d) d 

8 = l 

Proof: Since < d/m < IT we have: 



3S(T) 


< 


£UJ 


\ i 




< 


£w 


IT 




= 


^d i 

8-0 






< 


oo p 

Sir 

8 = '• 






= 


e d 





Dividing both sides of this inequality by (d/m) d yields the desired result. □ 

We may now characterize the growth of the Hg(m) function as follows: Hg(m) grows expo- 
nentially up to m = VC(G)T&nd Hg(m) grows at most polynomially after that. We may use 
this fact to obtain an upper bound on the VC-dimension of Q in the following way. Suppose that 
for some mFwe could show that Hg(m) < 2 m . Then m must be larger than the VC-dimension 
oiQ. 

A. 2. 2 VC-Dimension of Q' = Q U Q 

We now prove a result used in Section 3.3 concerning the VC-dimension of the query space used 
by our simulation of an SQ algorithm in the classification noise model. Recall that X is our 
instance spacer and Y = X X {0,1} is our labelled example space. For any labelled example 
y = (x, l)Twe define y = (x,l)Tand for any query xFwe define x(y) = x(y)- Finallyrfor any set 
of queries QLVe define Q = {x '■ X £ Q}- 
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If Q is the query space of an SQ algorithmrthen Q' = Q U Q is the query space of our 
simulation of this SQ algorithm. We may bound the VC-dimension of Q! as follows. 

Lemma 9 If Q' = Q\jQ, then VC(Q') < c ■ VC(Q) for a constant c « 4.66438. 

Proof: We first claim that VC(Q) = VC(Q). This fact can be shown as follows. For any 
X & QTwe have that xiv) = x(y)- Note that if X ^ Qrthen x £ Q- For any countable set 
T = {yi, . . .,y m }rthe labelling of T induced by \ 1S identical to the labelling of T induced 
by x where T = {y 1 , . . .,y m }. Thereforerif there exits a set of size m shattered by Qrthen 
there exists a set of size m shattered by Q. This implies that VC(Q) > VC(Q). The fact that 
VC(Q) > VCiQ) is shown similarlyrand thus VC(Q) = VC(Q). 

Let d = VC{Q) = VC(Q). For any m > dTwe have both II c (m) < (em/d) d and Ilg-(ra) < 
(em/d) d . Thusrfor any m > dTwe have IIg<(ra) < Ilg(m) + Ilg-(m) < 2 (em/d) d . 

If YiQi(m) < 2 m for some mrthen m > VC(Q'). ThusFany m > d which satisfies 

2(em/d) d < 2 m 

is an upper bound on VC(Q'). Setting m = c ■ d and solving for cFwe obtain: 

2(ecd/d) d < T d 

O 2(ec) d < {T) d 

<= (2ec) d < {T) d 

O 2ec < 2 C 

<= c > 4.66438 

Thusr VC(Q') < c ■ VC(Q) for a constant c « 4.66438 □ 

A. 2. 3 The VC-Dimension of the Query Space of Boosting 

We now prove a result used in Section 3.1 concerning the VC-dimension of the query space of 
boosting. Let Q and Ho be the query space and hypothesis class used by a weak SQ learning 
algorithm. The queries used by the strong SQ learning algorithm obtained by either Scheme 1 
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or Scheme 2 boosting are of the form x^X) an( i X^X) where x & Qo an d X) ls constructed from 
hypotheses in 7i . 

A particular query x) ls defined by i hypotheses and an integer jTO < j < i. x)( x ^) ls 1 
if exactly j of the i hypotheses map x to /rand x)( x i i s otherwise. Note that i is bounded 
by ki = j^ln ~ t in Scheme 1 boostingrand i is bounded by k 2 = -^-ln ~ t in Scheme 2 boosting. 
Also note that the hypotheses used to construct a particular x) need not be distinct. 

For fixed i and jTlet T* be the set of all x) queries. In additionFwe make the following two 
definitions: 



r = \Jv] 

3=0 
k 

p[ & ] — I I Y i 

8 = 1 

For any two sets of {0, l}-valued functions A and OTwe define 

AAB = {f a Af h -.f a eA,f b eB}. 

Fhe query space of boostingrQ^Tmay then be given as follows: 

Q B = Q u T^ u Qo A T™ 

Note that k = k x in the case of Scheme 1 boostingrand k = k 2 in the case of Scheme 2 boosting. 

We may bound the VC-dimension of Q B in terms of the VC-dimensions of Q and 7i in a 
manner similar to that used in the previous section. In particularFwe bound IIg (ra)riI r [k](ra) 
and n 2oAr [f«](m). By applying Femma 6Fwe obtain a bound on IIg B (ra). From this boundrwe 
obtain a bound on the VC-dimension of Q B . We begin by examining T 1 -. 

For any hypothesis h : X — ► {0, l}rwe define h : X X {0, 1} — ► {0, 1} as follows 

h(x, I) = (h(x) = I) 

where = is the binary equivalence operator. ThusTh(x, I) is true if and only if the hypothesis 
h maps x to /. Fet 7i = {h : h G "Ho}- We may now define a query x) £ T) as follows. Fet 
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hi, . . . , hi be the i hypotheses used to construct x) 



X){x,l)= < 



1 if exactly j of hi(x, /),..., hi(x, I) are 1 
otherwise 



From a set-theoretic perspectiveLwe can view x) an( i h as indicator functions for subsets of 
Y = X X {0, 1}. We then have the following: 

Y x i = {y G Y : y is an element of exactly j of the sets Y^ , . . . , Y^ } 

We next relate II r .(m) with II ^ (m) as follows. 
Claim 5 iWm) < W ° V . 



Proof: Consider a particular x* . We can view x) as either a mapping from Y to {0, 1} or as an 
indicator function for a set Y x y C Y. In the discussion that followsrit will be more convenient 
to view x) as an indicator function. 

Let T be any subset of Y of size m. II ^ (T) is the set of subsets of T picked out by 
functions h G "HoTand II r .(T) is the set of subsets of T picked out by functions x) i n T)- By the 
definition of x'Tnote that each unique set in II r .(T) must correspond to a unique collection of i 
sets in II ^ (T). Howeverrthe i sets in each unique collection need not be distinct. Thereforel 1 
the number of unique collections is given by the number of arrangements of i indistinguishable 
balls in |II^ (T)| bins. We thus have 



u r .m\<O u ^ T)l + ' sl 



which implies the desired result. □ 

By Lemma 6 and the definition of L 8 Lwe now have 

I[ r i(m) < (i+ 1) ■ [ Ho . 
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Furthermorerby Lemma 6 and the definition of r^Twe have 

n rW (m)<]T(i + i)- Ho 

By applying the well known fact that II^ aB (to) < II^(to) • Jl B (m) [4rpg. 104]Twe now have 

n CoArW (m)<n Co (m) $> + i)- n ° ■ 

Z = l x 

Finallyrby Lemma 6 and the definition of Q B Twe have 



II SB (m) < 



k x n- (m) + i&l\ „ , S A,. _, /ILc(ro)-Hol 



n Co (m) + B* + i)-( w » v y w )+n Co (m)B* + i)-( Wo ' V J" (A - 4) 

Note that Equation A. 4 is of the same form as Equation A.l. We can therefore simplify 
Equation A. 4 in a similar manner to obtain: 



II 2B (m) < 2(II 2o (m) + l)(% o (m) + 2) k (A.5) 

In order to bound the VC-dimension of Q^Twe must relate the VC-dimension of 7i with 
the VC-dimension of 7i . 

Claim 6 VC(H ) = VC(H ) 

Proof: We begin by noting that for any instance x £ XTh(x) = 1 if and only if h(x, 1) = 1. 
For any countable set S = {xi, . . . , x m } and hypothesis h £ "H rthe labelling of S induced by h 
is identical to the labelling of T induced by h where T = {{xi, 1), . . . , {x m , 1)}. Thusrif there 
exists a set of size m shattered by "H rthen there exists a set of size m shattered by 7i . This 
implies that VC(H ) > VC(H ). 

We next note that for all functions h £ 7i Th(x,l) = ^h(x,l). Now let 

be any countable set shattered by 7i . If {x,l) £ Trthen {x,l) £" T since {x,l) and {x,l) cannot 
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be labelled identicallyrwhich is required for shattering. Tliusr^* = {xi, . . . , x m } is of size m. 

Now note that h(x) = b if and only if h(x, I) = (b = I). Consider any labelling (6 l5 . . . , b m ) 
of S. This labelling of S would be induced by the hypothesis h G "Ho corresponding to the 
function h G "Ho which labels T as follows: ((&! = li), . . . , (b m = l m )). Since T is shattered by 
"HoTsuch a function and corresponding hypothesis must exist. Thusrif there exists a set of 
size m shattered by "H rthen there exists a set of size m shattered by 7i . This implies that 
VC{H ) > VC{H ). □ 

We are now in a position to prove the main result of this section. 

Lemma 10 VC(Q B ) = 0( VC(Q ) + VC(H ) ■ k log k) 

Proof: In order to bound the VC-dimension of Q^Twe need only find an m which satisfies 
IIg B (m) < 2 m . We begin by further simplifying the expression for IIg B (m). 

Assume that n 2o (m) > 1 and 11^ (to) > 2. Each of these assumptions is assured when 
m > 1 and the VC-dimensions of Q and 7i are at least 1. We then have the following: 

II 2B (m) < 2(n 2o (m) + l)(Il£ o (m) + 2) & 
< 2(2n 2o (m))(2n^(m)) & 
= 2*+ 2 n 2o (m)(Il£ o (m))* 

Let g = VC(Qo) and let d = VC(TC ) = VC{7i ). For any m > m&x{q ,d }Twe have both 
II 2o (m) < (em/q ) 9 ° and 11^ (to) < (em/d ) d °. We now have: 

II 2B (m) < 2 k+2 (em/q y°(em/d ) dok 

To bound the VC-dimension of Q^Twe need only find an m which guarantees that the right- 
hand side of the above inequality is at most 2 m . 

2 k+2 (em/q y°(em/d ) dok < 2 m 
<^ (k + 2) + q lg(em/q ) + d klg(em/d ) < m (A. 6) 

For fixed d Tq and kTthe above inequality has the form m > gi(m) + g^im) + g?,{m) where 
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each function gi(m) "grows" more slowly than m. In particularFeach function g, satisfies the 
following property (recall that we are restricted to values m > max{g , <^o}) : If fn% > <7i(3ra;)r 
then m > ^(3m) for all m > rrii. Our strategy is as follows. Find appropriate values of to, 
which satisfy to, > ^(3mj)rand let m = 3max{m 1 ,m 2 ,m 3 }. Then m must satisfy m > 
gi(m) + g^im) + g?,{m). The reasoning is as follows. SupposeFwithout loss of generalityrthat 
mi = max{rai,ra2,ra 3 }. We then have m = 3rai. Furthermorer rrii > ^ 1 (3m 1 )rand since 
rrii > m 2 an d m x > m 3 Twe also have nil > g2(3irii) and nil > g 3 (3irii). Combining these 
inequalitiesFwe have 3m! > ^ 1 (3m 1 ) + g2(^' m i) + S3(3mi) which implies the desired result. 

For gi(m) = k + 2Fwe may simply choose nil = k + 2. For gzim) = ft lg(em/q )Twe chose 
m 2 = 6g rwhich is verified as follows: 

6ft > ftlg(e(3 ■6q )/q ) 
O 6 > lg(18e) « 5.613 

For g?,{m) = d klg(em/d )Twe choose m 3 = 9rf ^lg^rwhich is verified as follows: 

9d klgk > d klg(e(3 ■ 9d klgk)/d ) 

O 91g£; > lg(27efclgA;) 

O k 9 > TlekXgk 

<= k 7 > 27e 

This final inequality is true for any k > 2. We have effectively shown the followingr which 
completes the proof: m = 3m&x{k + 2,6q ,9d klgk} = O(q + d klogk). □ 

A. 2. 4 The VC-Dimension of the Query Space of Hybrid Boosting 

We now prove a result used in Section 3.1 concerning the VC-dimension of the query space of 
hybrid boosting. As in Section A.1.2riet Qhb be the query space of the hybrid boosterFand let 
Qi/4 and TC1/4 be the query space and hypothesis class of the Scheme 1 booster. Furthermorer 
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let ki = &i(7, 1/4) and k 2 = & 2 (l/4, e). We then have the following analogs of Equation A. 5: 

U QHB (m) < 2(n Cl/4 (m) + l)(Il£ i/4 (m) + 2)*' 
n Cl/4 (m) < 2(n Co (m)+l)(% o (m) + 2)* 1 

NowrWi/4 is the set of hypotheses which are majority functions of up to k x hypotheses from 
TC r and TC1/4 = {h : h G TC1/4}. HoweverFone can also define each function h G fii/4 as 
follows. Given a function h G Tii/iTh corresponds to some hypothesis h G "Hi/a which in turn 
corresponds to some set of hypotheses {hi, . . .,hj} from 7i where j < k x . By definitionFwe 
have 

h(x, I) = (maj {hi(x), . . ., hj(x)} = I). 

Howeverrit is also the case that 

(maj {hi(x), . . .,hj(x)} = I) = maj {hi(x, /),..., hj(x, /)}. 

Thusrwe can think of TC1/4 as the set of majority functions of up to £7 functions from 7i . 

Nowriet 7i{i 4 be the set of majority functions of j functions from "HoTand let T be any 
subset of Y of size m. II ~ (T) is the set of subsets of T picked out by functions h G "HoTand 



n^j (T) is the set of subsets of T picked out by functions h G Ti-^u- Note that each unique 

1/4 ' 

set in ILfij (T) must correspond to a unique collection of j sets in II ~ (T). Since the j sets 
in each unique collection need not be distinctTthe number of such unique collections is given 
by the number of arrangements of j indistinguishable balls in |II~ (T)| bins. Thusrwe have 
|n^i (T)\ < ( |n «» (T)l+,_1 ) which implies that 

rt l/4 j 



%^(m)<( 



j 



ki 



Since W1/4 = \J ?^/ 4 rby Lemma 6rwe have: 



tt / \ / v^ / n -w ( m ) +i o1 
% 1/4 (™) < E( Wo j 



=1 
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£/%>)+j<*l\ 



j=0 ^ •? 

Ol 



n^ o (m) + A; 1 



k, 



< (%» + l) fcl ol 



Combining the above resultsFwe have the following: 

Ii QHB (m) < 2 (2(n Co (m) + l)(% o (m) + 2)* 1 4- l) ((% o (m) + l)* 1 + l) 



^2 



Nowrassume that IIg (ra) > 1 and II ~ (m) > 2. Each of these assumptions is assured when 
m > 1 and the VC-dimensions of Q an d 7^ are at least 1. We then have the following: 



k 2 



iWm) < 2(2(n Co (m) + l)(n^(m) + 2)^ + l)((% o (m) + l)^ + l) 

< 2 (2(2n Co (m))(2n^(m))^ + l) ((2n^(m))^ + l)*' 

< 2(2^+ 3 n 2o (m)(n^ o (m))^) (2^+ 1 (% o (m))^)* a 

= 2 (* i+i x* a+1 )+ 3 n So (m) (n^ o (m)) fci ^ +1 ) 

Let g = VC(Qo) and let <i = VC(TC ) = VC{7i ). For any m > m&x{q ,d }Twe have both 
IIg (m) < (em/q ) q ° and II ~ (to) < (em/d ) d °. We now have: 

n C ™(^) < 2( &1+1 ^ 2+1 ) +3 (em/go) ?0 (em/rfo) do " l(J;2+1) 

To bound the VC-dimension of QijsTwe need only find an m which guarantees that the right- 
hand side of the above inequality is at most 2 m . 

2(*i+ 1 )(*=>+ 1 )+ 3 (em/ 9o ) ?0 (em/do) <io * l( * a+1) < 2 m 
<= ((&! + 1)(& 2 4- 1) + 3) + q lg(em/q ) + do^fa + 1) lg(em/d ) < ™ (A.7) 

Inequality A.7 has the same form as Inequality A. 6. By appropriate substitutionFwe find that 
m = 3 max{(A;i + 1)(& 2 + 1)4-3, 6g , 9d k 1 (k 2 + 1) lg(Ari(A 2 + 1))} = O((q 4- <i ^i^2 log(k 1 k 2 )) is 
sufficient to satisfy the above inequality. We have effectively proven the following: 
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Lemma 11 VC(Q HB ) = 0( VC(Q ) + VC{H ) ■ A^log^^)) 

A. 3 A Lower Bound for Probabilistic SQ Algorithms 

Throughout this thesisFwe have assumed that SQ algorithms are deterministic and output an 
accurate hypothesis with certainty. In this section of the appendixFwe relax this condition by 
allowing probabilistic SQ algorithms which output accurate hypotheses with high probability. 
In particularly we show a lower bound on the query and tolerance complexities of such SQ 
algorithms which is analogous to the result obtained in Section 3.2.2. 

Consider the two player game described in Section 3.2.2. We modify this game by allowing 
the player to be probabilisticr and we only require that the player output an acceptable set 
with probability at least 1 0<Tfor some some 8 > 0. We may now show the following. 

Lemma 12 For any d > 4, t < d/4, 8 < 1/8 and N = £l(d 1+a ) for some a > 0, the probabilistic 
player requires O( lo ( °^ o N + ) queries of the oracle. 

Proof: By incorporating techniques from Kearns' lower bound proof [19]Twe can modify 
the original proof of Lemma 1 as follows. The adversary chooses the target set S randomly 
and uniformly from the set S of all ( d ) subsets of [N] of size d. Consider the first qneryTQiT 
submitted by the player. Qi partitions S into d-\-l sets Sq,Sq, . . .,Sq where each subset S G S l 
has \S fl Qi\ = i. Since the choice of the target set was randomruniform and independent of QiT 
the probability that the target set is an element of S l is proportional to \S l \. Note that (SiTby 
definitionris the set S l of which the target is a member. 

For any k > 2rconsider all S l for which \S l \ < \S \/(k ■ (d + 1)). Since there are only d + 1 
setsrthe total cardinality of such "small" sets is less than \S \/k. Thusfwe have the following: 
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|ci> l^ 
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k-(d+ 1). 
By successively applying this result through k queriesIVe obtain the following: 

I So 
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Note that for any k > 2T(1 &l/k) k G [1/4, 1/e). Thusrwith probability at least l/4rwe have 
a lower bound on the size of |<S fc |. We next show that if \Sk\ is sufficiently largerthen there is a 
significant probability that the player will fail if it halts and outputs a set at this point. 

Let T be any set output by the player at the end of the game. For any iTO < i < iVTnote 
that there are exactly ( ) sets S G 2^ such that \S A T\ = i. Thusrthere are exactly (( t )) sets 
S G 2^ such that \S A T| < t. Now suppose that \Sk\ > 2(( t )). Since the target set is equally 
likely to be any element of (SjTthe probability that T is an acceptable set is at most 1/2. Thusr 
the player will fail with probability at least 1/8 if it halts after k questions for any k which 
satisfies the following inequality: 

nV (n 



1} 


< 2 


c. 


d/4j) 


< 2 


feN\ d/4 
\d/4j 



~ (k-(d+l)) k ~ (k-(d+l)) k ~ '^' 
Solving the third inequality for (k ■ (d + l)) k and noting that d > 4Twe have the following: 

d/4 



O klogk + klog(d+l) < — l °g(j^) 



The latter inequality is implied by the following two inequalities: 

k log k < 

<= k < 



*log(d+l) < T log(^) 
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Each of these inequalities is implied by 
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for N = tt(d 1+a ). D 

Combining this result with the proof of Theorem 5rwe immediately obtain the following: 

Theorem 21 There exists a parameterized family of function classes which require 0( ,°^ J^} €)) / 
queries with additive error 0(e) to learn in the SQ model by a probabilistic algorithm. 
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Searching in the Presence of Errors 
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Chapter 7 



Introduction 



Coping with errors during computation has been a subject of long-standing interest. It has 
motivated research in such areas as error-correcting codesr fault-tolerant networksr boolean 
circuit evaluation with faulty gatesTand learning in the presence of errors. In the following 
chaptersrwe focus on the problem of searching in the presence of errors. 

7.1 Introduction 

Our goal is to find an unknown quantity xina previously specifiedrdiscreterbut not necessarily 
finiter domain by asking "yes-no" questions! 1 when some questions are answered incorrectly. 
We show that it is possible to cope with errors whose number may grow linearly with the 
number of questions askedrandrdepending on the class of questions allowedrto do so with an 
asymptotically optimal number of questions. Examining both adversarial and random error sT 
we find that even in a fairly restricted adversarial error modeirsearching is at least as difficult 
as in the random error model. 

The problem can be further qualified by: 

• Kinds of questions that may be asked. 

— Comparison questions: "Is x less than yV 
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— Membership questions: "Is x in the set 57" T where S is some subset of the 
domain. 

• Kinds of errors possible. 

— Constant number: It is known a priori that there will be at most k errorsrwhere 
k is some fixed constant. 

— Probabilistic: The answer to each question is erroneous independently with some 
probability pro < p < \. 

— Linearly Bounded: For some constant 7TO < r < |Tany initial sequence of i 
answers has at most ri errors. This model allows the answers to be erroneous in a 
malicious way. Unlikely scenarios in the probabilistic model (such as a long sequence 
of correct answers followed by a short sequence of false ones) must be dealt with here. 

• Domain of the quantity being sought. 

— Bounded: x £ {1, . . . , n]Tfor some known n. 

— Unbounded: x may be any positive integer. 

Much research has been devoted to the subject of searching in the presence of errors. 
Rivest et al. [25] have shown that in the bounded domain with at most k errorsFa; can be 
determined exactly with lgra + fclglgra + O(klgk) comparison questions. 1 Here k can be a 
function of rarbut not of the number of questions asked. When A; is a constantTthis is an 
asymptotically optimal bound since [~lg n\ is a lower bound on the number of questions needed 
to search even without errors. Naturallyrthis bound also applies to searching with membership 
questionsTsince comparison questions are a restricted version of membership questions. 

In the probabilistic error modeirwhere errors occur randomly and independently with prob- 
ability plVe must find the correct x with probability of failure at most S. Since S is previously 
known and hxedrwe consider S a constant for the purpose of measuring the complexity of the 
searching algorithm. 2 Pelc [24] showed that in the probabilistic error modeirwith error proba- 
bility p < l/2T0(lg n) questions are sufficient to search in the bounded domain. Frazier [13] 



The term lg n denotes log 2 n throughout this thesis. 

Typically, the complexity of such algorithms depends on log(l/5), as does the complexity of our algorithm. 
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Figure 7.1: Bounds for searching in the bounded domain with linearly bounded errors. Here 
n is a bound on the number being sought. 

improved the bound to O(lgnlglgn) questions using a somewhat complicated analysis. Finallyr 
using standard Chernoff bound techniquesFFeige et al. [12] showed that O(lgra) questions are 
sufficient for any p < 1/2. Our contribution here is a formal reduction from the problem of 
searching in the probabilistic error model to that of searching in the linearly bounded error 
model. To state this result informallyrwe show that an algorithm for searching in the presence 
of linearly bounded errors can be transformed into an algorithm for searching in the presence 
of random errors. In this senseFsearching with linearly bounded errors is at least as difficult as 
searching with random errors. When we are allowed to ask membership questionsrthis reduc- 
tion together with the results from the linearly bounded error model mentioned below matches 
the Feige et al. O(lgra) bound in the bounded domain. We also generalize this bound to the 
unbounded domain. 3 

In the linearly bounded error modeirPelc [24] showed that x can be determined exactly in 
O(lgra) questions in both the bounded and unbounded domains. Howeverrthese bounds only 
hold for r < 1/3. The best known bound using comparison or membership questions in the 
bounded domain for 1/3 < r < 1/2 was C^n' 81 ^ 7 ). Note that the degree of the polynomial 
in this bound is unbounded as r approaches 1/2. This bound comes from an analysis of a 
"brute-force" binary searchrwhere each question of the search is asked enough times so that 
the correct answer can be determined by majority. A simple argument [13F32] shows that the 
search problem cannot be solved (with either membership or comparison questions) if r > 1/2. 

We show significantly improved bounds in the linearly bounded error model which hold for 
the entire range < r < 1/2. With memberships questionsFwe show that x can be determined 
exactly in O(lgra) questions in both the bounded and unbounded domains. These bounds are 



In the unbounded domain, n now refers to the unknown number. 
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Figure 7.2: Bounds for searching in the unbounded domain with linearly bounded errors. Here 
n is the unknown number. 



tight since searching has a trivial 0(lg n) lower bound. With comparison questionsFwe improve 
the bounds to 0(n lgTr7 ) = o(n) questions for the bounded domain and 0([ralg n] lgTr7 ) = o(n) 
in the unbounded domain. A comparison of this work with the best known previous results 
can be found in Figures 7.1 and 7.2. Our results are obtained by looking at the search problem 
in the framework of chip games. These chip games have also proved useful in modeling a 
hypergraph 2-coloring problem [5]. In generairchip games model computational problems in 
such a way that winning strategies for the players translate into bounds on the critical resource. 
This critical resource is represented by some aspect of the chip gameFsuch as number of chips 
used or number of moves in the game. 

Spencer and Winkler [32] have also examined this problem. They have arrived independently 
at one of the theorems in this paper using different proof techniques. Their paper as well as 
one by DhagatTGacsFand Winkler [10] considers another linearly bounded model of errors. 

We begin in Section 7.2 by developing the framework of chip games within which we solve the 
search problem. Chapter 8 begins with a simple strategy for solving our problem in the linearly 
bounded model in the bounded domain which works with either comparison or membership 
questionsrbut whose obvious analysis gives an inefficient bound on the number of questions. 
We then improve this bound by analyzing this strategy using chip games. Chapter 8 continues 
by focusing on membership questions and proving an 0(lg n) question bound for this class. The 
chapter ends with a generalization of the above bounds for the unbounded domain. Chapter 9 
contains the aforementioned reduction between the probabilistic and linearly bounded error 
models in the bounded domainr and the O(lgra) question bound for the probabilistic error 
model which follows from it. These results are also generalized to the unbounded domain. 
Chapter 10 concludes the paper with a summary of the results and mention of some open 
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Figure 7.3: Chip Game 
problems. 

7.2 Searching and Games 

Searching for an unknown number x in {1, . . . , n} by asking "yes-no" questions can be restated 
in terms of the game of "Twenty Questions". In this game between two playersrwhom we 
denote Paul and Carolel 4 Carole thinks of a number between 1 and n. Paul must guess this 
number after asking some number of "yes-no" questions which is previously fixed. Our goal 
in this game is to determine how many questions Paul must be allowed in order for him to 
have a winning strategy in the game. Clearlyr [~lg n\ questions are sufficient if Carole always 
answers truthfully. The problem of searching with errors thus translates into playing "Twenty 
Questions" with a liar [33]. Corresponding to the aforementioned error modelsFwe consider 
both a probabilistic and an adversarial linearly bounded liar. 

The game against a linearly bounded liar can now be further reformulated as a Chip Game 
between two players: the Pusher and the Chooser. Pusher-Chooser games were first used by 
Spencer [31] to solve a different problem in his notes on the probabilistic method. The Chip 
Game starts with a unidimensional board marked in levels from on upwards (see Figure 7.3). 
We start with n chips on level Oreach chip representing one number in {l,...,ra}. At each 
steprthe Pusher selects some chips from the board. These chips correspond to the subset S of 
{1, . . .,n} that Paul wants to ask about. In other wordsFselecting S is tantamount to asking 
"Is x G SV\ The Chooser then either moves the set of chips picked by the Pusher to the next 
leveir indicating a "no" answer from Carole (x is not in S)Yoi it moves the set of chips not 



An anagram for the word "oracle," as this is her role in the game. 
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picked by the PusheiTmdicatmg a "yes" answer from Carole. ThereforeFa chip representing 
the number y is moved to the right if and only if Carole says that y is not the answer. The 
presence of a chip representing the number y at level i says that if y is the unknown number 
aTthen there have been i lies in the game. After some k stepsrif a chip is at any level greater 
than \_rk\ Tthen it may be thrown away since the corresponding number cannot possibly be the 
answer (too many lies will have occurred). To winrthe Pusher must eliminate all but one chip 
from the board. 

To clarify which chips may be thrown awayTwe maintain a boundary line on the board. 
After k stepsrthe boundary line will be at level \_rk\ . Thus the Pusher may dispose of the chips 
at levels to the right of the boundary line at any time. Note that the boundary line moves one 
level to the right after approximately 1/r steps. The number of questions that we need to ask 
to determine x exactly is the same as the number of steps needed for the Pusher to win the 
above Chip Game. 



Chapter 8 



The Linearly Bounded Error Model 



In this chapteiTwe show an C^n 181 ^ 7 ) question bound for searching with comparison questions 
and an O(lgra) question bound for searching with membership questions. We first show an 
f^n'Si^^ 7 ) lower bound for a "brute-force" strategy. Strategies similar to this "brute-force" 
method are given by Pelc [24] and Frazier [13]Tand these were the best known results for 
1/3 < r < 1/2 prior to this work. 

8.1 A Brute-Force Strategy 

To determine an unknown number x £ {1, . . . , ra}Ta "brute-force" strategy simply performs a 
binary searchTrepeating each question enough times so that majority gives the correct answer. 
Let q(i) be the number of times question i is repeatedTand let Q(i) be the total number of 
queries through question i (Q(i) = Yl)=i q(J))- To guarantee that majority gives the correct 
answerTwe insure that the number of lies the malicious oracle can tell is less than half the 
number of times question k is repeated. We thus obtain the following: 



r(Q(k^l) + q(k)) < q(k)/2 

q(k) y rl^ g(A;0l) 
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We now use the fact that Q(k) = Q(k Ol) + q(k): 



Q(k) = Q(k^l) + q(k) 



1 g(fcoi) 



lo2r 

ThusTQ(k) = £l((j^;) k ). Since the correct answers to [~lg n\ binary search questions must be 
obtainedrwe obtain the following lower bound for the "brute-force" strategy: 

tt((T^) ngnl ) = ^(T^) 18 ") 
= ft(n lg ^) 

A similar upper bound can also be shown for this strategy. While this strategy is soundrits 
naive analysis yields an unsatisfactory bound. We can improve on this significantly through 
the use of chip games. 

8.2 Searching with Comparison Questions 

We now consider an essentially identical strategy in the chip game. The Pusher plays this game 
in phases. Each phase corresponds to getting the correct answer to a single question. At the 
beginning of each phase there is a single stack of chips somewhere on the board. The Pusher 
picks a subset of these chipsFgenerally some half of them which corresponds to a comparison 
question whose correct answer he wishes to determine. He continues picking the same half-stack 
throughout this phase until either it or the other half-stack moves beyond the boundary line. 
Then he begins the next phase with the remaining half-stack. This continues until there is only 
one chip remaining on the board to the left of the boundary line. Note that if there are m chips 
on the board initiallyr [~lg to] questions need to be answered correctly. 

Now consider the board before and after some phase j. At the beginning of phase jTthere 
is a stack of chips at a level some distance lj away from the boundary line (see Figure 8.1). At 
the end of phase jTone half-stack has moved some distance i from its original position and the 
other half-stack has moved one position past the boundary line. The boundary line is now at 
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After phase j 




Figure 8.1: Chips before and after phase j. 

some distance /j +1 from the first half-stack. 

Let T(d, I) be the number of steps the Pusher takes to have d questions answered correctly 
when a single stack of chips on the board is a distance / away from the boundary line. We then 
have 

T(d,lj) = T(rfol,Ij + i) + (steps during phase j). 

We next bound the number of steps in phase j as a function of rTlj and lj+i- 

Lemma 13 The total number of steps during phase j is less than J ~j^ — . 

Proof: The total number of steps during phase j is equal to the total number of levels the two 
half-stacks move. One half-stack moves i levels and the other moves i + /j +1 + 1 levelsLand thus 
the total number of steps in phase j is 2i + /j +1 + 1. 

Let Sj be the total number of steps prior to phase j'Land let pj be the position of the 
boundary line prior to phase j. Note that pj = [rsj\ . Using the fact that x Ol < [^J ^ xTwe 
have the following: 

Pj+i < rs j+i 
Pj > rsj Ol 

Subtracting these two inequalitiesLwe obtain: 

Pi+i&Pi < r(s i+1 Os i ) + 1 



NowLsj +1 OSj is the total number of steps during phase j'Land pj +1 Opj is the number of levels 
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the boundary line moves. We therefore have the following inequality 



i + lj+i O/j < r(2i + l j+1 + 1) + 1 



f8.1) 



which implies that 



i < 



lj o(!Or)/ i+1 + 1 + r 



lo2r 

Substituting this bound on i into the expression 2i + /j +1 + lTwe obtain the desired result. □ 
Now we are ready to show that: 



Theorem 22 T(d,l ) = < 



^Odj^Y) ifl = 
Proof: We first show that VjT/j +1 < J r . Solving Inequality 8.1 for /j +1 rwe obtain: 



/,- + ! < 



< 



lj + 1 + r + i(2rOl) 

1 Or 
h + l + r 



lOr 



This last inequality holds due to the fact that 2r Ol < and i > 0. Successively applying this 
inequalityrwe obtain the following: 
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lj-i + 1 + r 
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We can now use this fact to obtain a bound on T(d, l ): 



T(d,l ) < T(dol,/i) + 



2/ o/i + 3 



lo2r 
2/ o/i + 3 2/i 0/ 2 + 3 



lo2r 



lo2r 



2/ d _i o/d + 3 
lo2r 
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For any constant rTthis expression is 0(lo(jz^) d ) if l > or 0((Y^:) d ) if /o = 0. □ 

This result will be used throughout this thesis. In particularr consider the problem of 
searching in the bounded domain with comparison questions. The corresponding chip game 
begins with n chips and the boundary line at level 0. Using binary searchrwe require the 
correct answers to fig n] questions. Employing Theorem 22rwe immediately obtain: 

Theorem 23 The problem of searching in the linearly bounded error model in the bounded 
domain {1, . . . ,n} with comparison questions and error constant r, < r < 1/2, can be solved 
with 0(ra lgT:r7 ) questions. 

8.3 Searching with Membership Questions 

We show a winning strategy for the Pusher which requires 0(lg n) steps. The strategy works in 
three stages. In Stage lTthe Pusher eliminates all but O(lgra) chips from the board in O(lgra) 
steps. In Stage 2rthe Pusher eliminates all but 0(1) chips from the board in an additional 
O(lgra) steps. In Stage 3rthe Pusher removes all but one chip from the board in the final 
O(lgra) steps. 



8.3.1 Stage 1 

The strategy employed during Stage 1 is simple. We describe it inductively on the number of 
steps as follows. Let h m (i) be the height of the stack of chips at level i after m steps. In the 
(to + l)-st steprthe Pusher picks m r j*' from each stack of chips at all levels i. He continues 
this way for C\ lg n steps (where C\ is a constant that will be determined in the analysis). 

Before we can analyze this strategyTwe will need a few definitions. Define normalized 
binomial coefficients b m (i) as 

h c\ n ( m 

MO = ^ ( t 
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and let 



A m (i) = h m (i)&b m (i). 



The normalized binomial coefficient b m (i) will approximate /i m (i)rthe height of the stack at 
level i after m stepsFwhile A m (i) will account for any discrepancy. 

In order to analyze the given strategyTwe need to be able to determine the number of chips 
which are to left of the boundary line after some number of steps in our strategy. After m stepsr 
this is equivalent to J2i<\ rm \ h m {}) (since r is the rate at which the boundary line moves). This 
sum is difficult to determine exactly. Insteadrwe will derive an upper bound for it by using the 
fact that Ei<Lrmj h m (i) = E;<|_rmj M0 + E;<|_rmj A m (i). In particularlwe will show an upper 
bound for Ei< L ™j A m (i). 

For the strategy given abovelVe now bound the discrepancy between the actual number of 
chips in any initial set of j stacks and the number of chips predicted by the normalized binomial 
coefficients. We will need three lemmas. The first two lemmas handle boundary conditions! 1 
while the third is required in the proof of the main theorem. 

Lemma 14 (Vra > 0), A m (0) < 1. 

Proof: The proof is by induction on m. 

• base case: For m = 0Th (0) = n = b (0) =^ A (0) = 0. 

• inductive step: Assume A m _i(0) < 1. We now have the following: 
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< 6 m (0) + l 



ThusrA m (0) = h m (0) O6 m (0) < 1. 
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Lemma 15 (Vra > 0), yjA m (i) = 0. 

8 = 

Proof: 



Y h m (i) = n = Y MO => Y A ™(0 = ° 
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Proof: We first note the fact that ( ° 



a * 1 ). The proof proceeds as follows: 
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Theorem 24 (Vra > 0) (Vj < to), ^ A m (i) < j + 1. 

8=0 

Proof: The proof of the theorem is by induction on m. The base case of m = is trivial. In 
the inductive steprwe show that if the theorem holds for m OlTthen the theorem holds for m. 
The boundary conditions j = and j = m are handled by Lemmas 14 and 15. We concentrate 
on the case < j < m below. Consider the following (see Figure 8.2): 



Y hm ^) - Y hm - i ( i ) + 



8 = 



8=0 

.7-1 



< Y h m-l(i) + 



h m -i{j) 



h m -i(j) . 1 
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1 2 3 ... j-1 j ■ 

I I - chips that move 

| - chips that do not move 




1 2 3 ... j-1 j 

Note: shaded chips are the same 
Figure 8.2: Chips before and after step m 



8 = 8=0 

A m -i(j) 1 

2 2 



A m _i(j) 1 
2 2 



8=0 8=0 



We now bound the quantity YH=o A m _i(i) H — "* 2 J + \- There are two casesrdepending upon 
whether A m _i(j) < 1 or A m _i(j) > 1. If A m _i(j) < lTwe have the following: 



8 = 



A m _i(j) 1 
2 2 



Ea^co + ^^ + J < E A ™-i« + i 



8 = 

< i + i 



If A m _ 1 (j)> irthen 



A m -i(j) , 1 



< A m _i(j). We thus obtain the following: 



A m _i(j) 1 
2 2 



E A ™-i(0 + ^^ + ^ < E A ™-i( 



8 = 

< i + i 



We therefore have 



^2,h m {i) < ^2b m (i) + J2 A m-i(i) + 



A m _i(j) , 1 



8 = 



8 = 



8 = 
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< J2b m (i)+j+l 

8 = 

i 
which implies that YJ A m (i) < j ' + 1- n 

8 = 

Now we will bound X^xirmi b m (i). We will find a constant Ci such that for m = CilgnT 
2}=o b m (i) is a constant. If we can do thisrthen it follows from the theorem above that 
2}=o h m (i)Tthe number of chips remaining to the left of the boundary liner is O(lgra). The 
reasoning is as follows: 

\rm\ \rm\ \rm\ 

J2 h m (i) = J2 b m(i)+ J2 A ™(0 

8 = 8 = 8 = 

\rm\ 

< Y, M0 + [rm\ + 1 

8 = 

= c 2 + [r • ci lg raj 

< c 3 lg n 



for appropriate constants c 2 and c 3 . 

In order to determine CiTwe make use of the following bound [22]: 



X) ( ) < ^^ 

where < fj, < 1/2 and H(r) is the binary entropy function. 1 We now have: 



\rm\ \_rm\ , 

EM0 = iE (7 

8 = ^ 8 = V ' 

< ra njnH(r) 

- 2 m 

= ra 2 m(ff(r)_:L) 

This last quantity is 0(1) when m = 1 _ K I ^ l (r y Thus if we pick c x = 1 _j 1( , T then after 
m = C\ lg n stepsrthere will be at most c 3 lg n chips remaining on the board to the left of the 



1 H(r) = -rlgr - (1 - r)lg(l - r) 
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boundary line. The strategy in this stage can also be applied to the game where the boundary 
line starts out at level / = O(lgra) instead of at / = 0. One can show directly or through the 
use of the techniques given in Sections 9.1.1 and 9.1.2 that Stage 1 still ends in O(lgra) steps 
with at most O(lgra) chips to the left of the boundary line. This fact will be useful when we 
examine the unbounded domain. 

8.3.2 Stage 2 

At the end of Stage 1 we are left with some c 2 lg n chips on the board with the boundary line 
at level c x lg n (for appropriate constants c x and c 2 ). After Stage 2rthere are 0(1) chips on the 
board to the left of the boundary line after O(lgra) additional steps. 

Before starting Stage 2rwe alter the board by moving everything on the board (chips and 
boundary line) to the right by c 2 lgnPso that the boundary line is now at level (ci + c 2 )lgn = 
clgra. While this new board corresponds to a different game than the one we have played until 
now (it corresponds to a game in which many more questions and lies have occurred)T these 
two boards are equivalent in the sense that the Pusher can win from the first board within k 
extra moves if and only if he can win from the second board within k extra moves. 

Now move the chips to the left in such a way that there is exactly one chip on each of the 
first c 2 lg n levels. Note that the Pusher does not help himself by doing thisFsince moving chips 
to the left is in effect ignoring potential lies which he has discovered. 

At each step in this stagerthe Pusher first orders the chips from left to right Tor dering chips 
on the same level arbitrarily. He then picks every other chip according to this order; that isrhe 
picks the Istr3rdr5thr. . . chips. We say that the board is in a nice state if no level has more 
than two chips. 

Lemma 17 Throughout Stage 2, the board is in a nice state. 

Proof: We show this by induction on the number of steps in Stage 2. Certainly at the beginning 
of Stage 2rthe board is in a nice state since each level is occupied by at most one chip. Now 
suppose that the board is in a nice state after i steps. Consider any level j after the (i + l)-st 
step. Since both levels j Ol and j had at most two chips before the (i + l)-st stepPafter this 
step level j retains at most one chip and gains at most one chiprthus ending with at most two 
chips. □ 



8.3 Searching with Membership Questions 107 

We now show that after O(lgra) stepsrthere are at most 2k chips remaining to the left of 
the boundary line. Here A; is a constant (depending only on r) which will be determined later. 
If there are fewer than 2k chips to the left of boundary linerStage 2 terminates. Let the weight 
of a chip be the level it is onLand let the weight of the board be the weight of its 2k leftmost 
chips. 

Lemma 18 After each step in Stage 2, the weight of the board increases by at least k Ol. 

Proof: Of the 2k leftmost chips after step iLat least (2£;Ol) chips remain in the set of leftmost 
2k chips after step i + 1. (The 2A;-th chip may be on the same level as the (2k + l)-st chip. In 
this caserif the 2A;-th chip moves in step i + lLthen the (2k + l)-st chip becomes the new 2A;-th 
chip in the revised ordering.) At least |_ 2fc ~ 1 j = k Ol of these chips move to the right one level 
during step i + irthus increasing the weight of the board by at least A; Ol. □ 

Let S be the number of steps taken during this stage and let W be the weight of the board 
at the end of these S steps. Since the weight of the board goes up by at least £;Ol in each stepL 
and since the initial weight of the board was non-negativeLI'L > (k 01)5. At the end of the S 
stepsLthe boundary line is at clgra + [rS\. Since this stage ends when there are fewer than 2k 
chips to the left of the boundary lineLwe certainly have W < 2£;(clgn + rS). Combining these 
two inequalitiesLwe obtain: 

2k(clgn + rS) > S(k^l) 

2kc 1 
o < — Ig n 

If we let k = Y3_Lthen S < y^r^lgra = O(lgra). Thus after O(lgra) stepsL Stage 2 ends 
leaving at most 2k chips to the left of the boundary line. 

8.3.3 Stage 3 

At the beginning of Stage 3Lthe Pusher moves all of the remaining chips to level 0. Again this 
is legalL since he is essentially choosing to ignore some information he has gathered. We now 
have some 2k chips a distance clgra away from the boundary line (for appropriate constants c 
and k). By applying Theorem 22 from Section 8.2Lthe Pusher can win this game in 0[(clgra) • 
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(y^:)r ig2& l] = O(lgra) steps. Since each of the three stages takes O(lgra) stepsIVe now have the 
following: 

Theorem 25 The problem of searching in the linearly bounded error model in the bounded 
domain {1, . . .,ra} with membership questions and error constant r, < r < |, can be solved 
with O(lgra) questions. 

8.4 Unbounded Search 

Now consider the problem of searching for a positive integer in the presence of errors as beforer 
but where no upper bound on its size is known. Let this unknown integer be n. Using strategies 
developed in this paper alreadylwe show that n can be found with 0(lg n) membership questions 
and 0([nlg n]' 8 *^) = o(n) comparison questions. 

The search occurs in two stages. FirstTwe determine a bound for the unknown number n. 
Secondrgiven a bound on nTwe employ the techniques for bounded searching given above. 

8.4.1 Unbounded Search with Membership Questions 

Consider the problem of bounding the unknown number n if all of the answers we receive are 
known to be correct. We could ask questions of the form "Is x < 2 2 ?". We would begin by 
asking "Is x < 2 2 ?". If the answer were "no'Twe would follow with "Is x < 2 2 ?'Tand so on. 
Since n < 2 2 Twe will obtain our first "yes" answer (and thus have a bound on n) after at 

most [lglgra] questions. We further note that our bound is not too large: 

2 2rislsrel <2 2lsls " +1 =2 21 s" = n 2 

Employing the techniques and results of Section 8.2rwe can use the above strategy in the 
presence of errors. We need the correct answers to [lglgra] questions. By Theorem 22rwe can 
obtain these answers in 

0((T^) nglgnl ) = OCOgn) 1 ^) = o(lgn) 
questions. 
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Having found a bound for rarwe have reduced our unbounded search problem to a bounded 
search problem. We can now apply our bounded search strategy of Section 8.3. It is important 
to note that since we have already asked o(lg n) questionsrthe boundary line will have moved 
to o(lgra). But recall that Stage 1 of our bounded search algorithm can tolerate the boundary 
line starting at O(lgra). Thus the Pusher can now start with all relevant chips at level and 
boundary line at level o(lgra) and apply the bounded search strategy of Section 8.3. Since our 
bound on the unknown number n is at most n 2 Twe will finish this stage after 0(lg(ra 2 )) = 0(lg n) 
questions. We can now claim the following: 

Theorem 26 The problem of searching in the linearly bounded error model in the unbounded 
domain with membership questions and error constant r, < r < |, can be solved with O(lgra) 
questions, where the number being sought is n. 

8.4.2 Unbounded Searching with Comparison Questions 

We can employ techniques similar to those used above to solve the unbounded search problem 
using comparison questions. We first determine a bound on the unknown integer n using 
the strategy developed above. We thus bound the unknown number n by at most n 2 using 
(^((lgn) 18 !^ 7 ) questions. Note that the boundary line will now be at O ( (lg ra) lg i^^ ) . 

Having bounded the unknown number n by at most n 2 Twe could simply use Theorem 22 
directly. By performing a simple binary searchrwe will need correct answers to at most |~lg(n 2 )] 
questions. Using Theorem 22rwe obtain an overall question bound of 

0((lgn)^rh . (_1_)N« 2 )1) = 0([n 2 lgn]^rh). 

This can be improvedrhoweverrby adding an extra stage. After bounding the unknown 
number n by at most ra 2 Tpartition this bounded interval into exponentially growing subintervals 
Ij = [2 J , 2 J+1 Ol] Vj > 0. Note that there will be at most |~lg(n 2 )] such subintervals. To 
determine the correct subintervair we perform a simple binary search on these subintervals 
requiring correct answers to [lg [~lg(n 2 )~|] questions. By Theorem 22rwe will need 



0((lgn) 1 «ri7.(-i-)ri g N" 2 )ll) = 0([lg 2 n] 1 «ri7; 
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additional questions. Since our subintervals grew exponentiailyrthe subinterval containing the 
unknown number n will be of size at most n. We can thus perform a final binary search on this 
subinterval and employ Theorem 22 to obtain an overall question bound of 

0([lg 2 n] lgT ^ . ( T l_)ri8"l) = 0([nlg 2 n] lgT ^) = o(n). 

Theorem 27 The problem of searching in the linearly bounded error model in the unbounded 
domain with comparison questions and error constant r, < r < |, can be solved with 
0([ralg ra]' 81 ^ 7 ) = o(n) questions, where the number being sought is n. 



Chapter 9 



The Probabilistic Error Model 



Recall that in the probabilistic error modeir Carole lies randomly and independently with 
probability pTand Paul must determine the unknown number x correctly with probability at 
least 1 O^rfor a given 8 > 0. In this chapter we give a reduction to show that searching in 
the probabilistic error model is no more difficult than searching in the linearly bounded model. 
Formallyrwe show that if At is an algorithm whichrgiven n and 7Tsolves the linearly bounded 
error problem in f(n,r) questionsrthen we can construct an algorithm A p which solves the 
probabilistic error problem in /(era, ^-^-) questions where c is a constant depending on p and 8. 
An O(logra) bound for the probabilistic error model with membership questions follows easily 
from the results of the previous chapter. We also generalize our results to the unbounded 
domain. 

9.1 The Reduction 

The terms "algorithm" and "strategy" will be used somewhat interchangeablyTsince a winning 
strategy for the Pusher in the Chip Game corresponds to an algorithm to solve the search 
problem with errorsTand vice versa. 

We now construct an algorithm which solves the probabilistic error problem from an algo- 
rithm which solves the linearly bounded error problem. Let At be an algorithm which solves 
the linearly bounded error problem. At requires values for n and rTas well as access to an 
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oracle whose errors are linearly bounded (an oracle which gives at most ri errors to any initial 
sequence of i questions). Algorithm At will ask f(n,r) questions and will return the correct 
element x £ {l,...,n} with certainty. Let A p be an algorithm which solves the probabilistic 
error problem. A p requires values for n, p and 8, as well as access to an oracle whose errors are 
random (an oracle which lies randomly and independently with probability p). Algorithm A p 
will ask g(n,p,8) questions and will return the correct element x £ {1, . . .,ra} with probability 
at least 1 — 8. 

In order to solve a probabilistic error problem with a linearly bounded error algorithm, we 
must insure that the errors made by the probabilistic oracle fall within those allowed by the 
linearly bounded error algorithm (with high probability). One method to accomplish this is to 
set r > p. This assures that in the long run, with high probability, the number of lies told by the 
probabilistic oracle will be fewer than the number of lies the linearly bounded error algorithm 
can tolerate. The danger here lies at the beginning of the game where it is relatively likely 
that too many lies will be told, and hence the correct chip will be thrown out by the linearly 
bounded error algorithm. To overcome this difficulty, we must prevent the linearly bounded 
error algorithm from throwing out the correct chip in this critical stage. 

We proceed in two stages. In the first stage, we play a modified game with excess chips in 
such a way as to guarantee that the correct chip is not thrown out until at least m questions 
have been asked of the probabilistic oracle. In the second stage, we find the correct chip among 
those remaining with high probability. 

9.1.1 Stage 1 

We begin the game by setting r midway between p and 1/2. Thus, r = ' 2 +p = ^-^-. This 
insures that the number of errors given by the probabilistic oracle will be fewer than the number 
of errors which can be tolerated by the linearly bounded error algorithm in the long run with 
high probability. To insure that the correct chip does not cross the boundary line before the 
m-tli step, we begin the game with n2~^~ m chips. 

In the first critical ^-^m steps, we intercept algorithm A^s queries to the oracle and answer 
them so as to maximize the number of chips which are left at level 0. We first note that after 
these — m steps, the boundary line will be at (1 — r)m. Second, since the number of chips at 
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n2 



(l-r)m/r steps 






m steps 



nlinnhnn 




12 3... 0123 
-< We answer questions >- 



(l-r)m 1 2 3 ... (l-r)m m 

-< Oracle answers questions »- 



Figure 9.1: Stages of the Reduction 

level are reduced by at most half in each step, there will be at least n chips remaining after 
these — m steps. See Figure 9.1. 



9.1.2 Stage 2 

Associated with each chip in the current game is an element of the set {1, . . . ,n2~^ m }. For 
a chip u, let this be the OldValue(u). We now establish a new correspondence between n 
of the remaining chips at level and the set {l,...,n}. For a chip u, this will be New- 
Vaiue(u). This new correspondence is order-preserving in the following sense: for chips u 
and v, OldVaiue(u) < OldValue(v) iff NewValue(u) < NewValue(v). The necessity for estab- 
lishing an order-preserving correspondence stems from the need to have this reduction apply to 
the searching problem where only comparison questions are allowed. We now continue running 
algorithm At, sending his queries to the probabilistic oracle after translating them thus: Let 
C = {u : At picks u and NewValue(u) £ {1, . . . ,n}}. That is, C is the set of selected chips 
which have defined NewValues. Let Sc = {NewValue(u) : u £ C}, that is, Sc is the set of 
associated NewValues. If Sc J^ 0, then we ask the probabilistic oracle about Sc and return the 
oracle's answer to At. If Sc = 0, then we could ourselves immediately answer "no". However, 
it is more convenient in the analysis to have the probabilistic oracle answer all questions in 
this stage. Thus, when Sc = 0, we ask the probabilistic oracle about {l,...,ra}, and return 
the opposite of its answer to At- Suppose that At finishes and returns chip u. We then return 
NewValue(u) or "fail" if NewValue(u) is not defined. 
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9.1.3 The Analysis 

We now claim that for an appropriate to, the above procedure will terminate with the correct 
value with probability at least 1 — 8. We in fact show that the probability that the "correct 
chip" ever crosses the boundary line is at most 8. If the correct chip never crosses the boundary 
line, then the linearly bounded error algorithm must return the correct chip when it terminates, 
and hence the correct answer will be obtained. 

Our analysis makes use of Hoeffding's Inequality [18] to approximate the tail of a binomial 
distribution. Let GE(p,m,n) be the probability of at least n successes in m Bernoulli trials, 
where each trial has probability of success p. Hoeffding's Inequality can then be stated as 
follows: 

GE(p, to, (p + a)m) < e - 2 « 2 ™ 

After the first — m steps, the correct chip will be at level 0, and the boundary line will be 
at level (1 — r)m (see Figure 9.1). Since a chip can move at most one level per question and 
the boundary line moves at a rate r, none of the n remaining chips at level will cross the 
boundary line until at least m questions have been asked of the probabilistic oracle. For any 
j > to, the probability that the correct chip is past the boundary line after j questions have 
been asked of the oracle is given by GE(p,j,jr + (1 — r)m). The probability that the correct 
chip is ever past the boundary line is therefore at most 



J2GE(p,j,jr+(l 



r)m) 



Given that: 

• If n > n 1 then GE(p, to, n) < GE(p, to, n 1 ) 



. r= i±2£ =p+ i^£ 



we can apply Hoeffding's Inequality: 



J2GE(p,j,jr+(l-r)m) < J2 GE (P^^J r ) 

j =m j =m 



J2GE(p,j,{p+^)j) 



]=m 
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< £ 



-2(i^) 2 i 



Since we would like this sum to be at most 6, we can now solve for m. Let 7 = e 2 ( 4 ) . Note 
that 7 < 1. 



£7' < * 

;'=m 

-!■ — < tf 
1-7 " 

In i + In -^ 

\ 1-7 
to > 



lni 

7 



l-2p ^2 (l-2p) J 



_tor 7 = e l 4 > = e « , we obtain 



(1-2?) 2 

Sln-r + 81nll/(l - e 
to > 



81nf + 81n[l/(l-e- « 



(1 - 2 P y 

Noting that for < p < ± < (1 ~ 2p)2 < ± we can use the fact that for < x < ± ±^ > -^ 

-^ 2 " 8 8 " — 8" x 1 — e" 

to pick 



8 In \ + 8 In 



128/15 



_ S ' " (l-2p) 2 

We can now conclude the following theorem: 

Theorem 28 Let At be a linearly bounded error algorithm which requires f(n,r) questions. 
Then At can be used to solve a probabilistic error problem specified by n, p, and 8 in f(cn, ^j^-) 

3-2p 

questions where c = 2 1 + 2 " m and m is as given above. 

An O(logra) bound for the probabilistic error model with membership questions now easily 
follows from the results of the previous section. 

Theorem 29 The problem of searching for an unknown element x £ {1, . . . , n} with confidence 
probability 8 in the presence of random errors (occurring randomly and independently with fixed 
probability p < 1/2) can be solved with O(logra) membership questions. The dependence of these 
bounds on p and 8 is polynomial in y^y- an d logarithmic in 1/8. 
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9.2 The Unbounded Domain 

We now consider the problem of searching for an unknown integer in the presence of random 
errors where no bound on the unknown number is known. Let this unknown integer be n. Our 
strategy proceeds in two stages. In the first stage, we obtain a bound for the integer n. In the 
second stage, we apply our techniques for searching in the bounded domain given above. To 
insure that our overall procedure fails with probability at most 8, we require that each of these 
two stages fails with probability at most 8' = 8/2. 

9.2.1 Stage 1 

By obtaining the correct answers to [lglgra] questions of the form "Is x < 2 2 ?" as in sec- 
tion 8.4.1, we can bound the unknown number n by at most n 2 . 

We might now imagine determining the correct answers to these [~lg lg n\ questions by asking 
each one sufficiently often so that majority is incorrect with some sufficiently small probability. 
Unfortunately, to determine how much error is "sufficiently small" requires that we know the 
value of n. Since n is unknown here, we will require a more subtle querying algorithm. 

To insure that our procedure fails with probability at most 8', we require that the correct 
answer to question i is obtained with error at most 8' /2\ Consider asking the i-th question m(i) 
consecutive times and taking the majority vote of the responses to be the "correct" answer. 
The probability that our posited answer is incorrect can be calculated as follows: 

Pr[majority vote is wrong] = Pr[at least half errors] 

= GE(p,m(i),m(i)/2) 

= GE(p,m(i),(p+ [1/2 - p])-m(i)) 

< e -2(l/2-p) 2 m(i) 

(l-2i>) 2 m(Q 

= e 2 

Since we require this probability to be at most 8' /2\ we obtain the following: 
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m(i) > 



In 4 



(1 - 2 P y 



l ^ + l 



Now, since our procedure will terminate (with probability at least 1 — 8') after the correct 
answers to [~lg lg n\ questions have been obtained, we arrive at an overall question bound of 



"lglgn] 

8 = 1 



Tlglgn"! 

E 

8 = 1 

In 4 



In 4 



(1 - M 2 



lg l + l 



(i - 2 P y 

0( [lglgn] 



[lg lg n\ lg - 



[lg lg n\ ( [lg lg n\ 



We thus bound the unknown number n by at most n 2 using OQlglg n] 2 ) (comparison) questions. 



9.2.2 Stage 2 

We can now simply apply the bounded searching techniques for membership questions described 
in previous section or the bounds of Feige et al. [12] for comparison questions. We can thus 
obtain the correct answer (with high probability) in an additional 0(lg ra 2 ) = 0(lg n) comparison 
or membership questions. Thus, we can conclude the following theorem: 

Theorem 30 The problem of searching for an unknown element n in the unbounded domain 
of all positive integers with confidence probability S in the presence of random errors (occur- 
ring independently with fixed probability p < 1/2) can be solved with O(lgra) comparison or 
membership questions. The dependence of these bounds on p and S is polynomial in jz5~ an d 
logarithmic in 1/8. 
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Conclusions and Open Questions 



We have examined the problem of searching in a discrete domain under two different error 
models: the linearly bounded error model and the probabilistic error model. 

In the linearly bounded error model, we have shown that O(lgra) membership questions are 
sufficient to search in both the bounded and unbounded domains. With comparison questions, 
we show bounds of C^n' 81 ^) and 0([nlg ra]' 81 ^ 7 ) in the bounded and unbounded domains, 
respectively. 

Our reduction from the probabilistic to the linearly bounded error model shows that the 
searching problem is at least as difficult to solve in the linearly bounded error model as in the 
probabilistic error model. This gives evidence that the linearly bounded error model deserves 
further investigation. A corollary of this reduction gives another proof of the 0(lg n) bound on 
membership questions required to search with probabilistic errors. Previously known bounds 
are also extended to the unbounded domain. 

Two questions arise directly from this work: 

1. In the linearly bounded error model, can we show a logarithmic upper bound on the 
number of comparison questions required when the error rate is between 1/3 and 1/2? 
Using techniques similar to ours, Borgstrom and Kosaraju [8] have recently shown that 
this is the case. 

2. Can a strict inequality be shown between the probabilistic and linearly bounded models 
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with respect to the problem of searching? That is, can it be shown that searching in the 
presence of linearly bounded errors with some question class requires an asymptotically 
greater number of questions than searching in the presence of random errors with the 
same question class? This problem remains open. 
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